Architecting for Analytics

Before building an analytics system, IT departments must consider these key issues.

To get the payoff from Big Data, users have to make a lot of decisions. Intel has seen many approaches to building an analytics ‘stack’ and their architectural implications. Here are some factors that can make your project a success.

IT executives need to decide how far data should travel before it’s streamlined and analyzed. The two most practical choices each have strengths and drawbacks.

On the other hand, having to sift through raw data can slow down analysis, and data lakes inevitably store data that ultimately isn’t needed.

For Patricia Florissi, global chief technology officer for sales and distinguished engineer at EMC, the pros outweigh the cons.

“You should be able to do analytics without moving the data,” she says.

In its data lake solutions, EMC stores raw data from different sources in multiple formats. The approach means that analysts have access to more information and can discover things that might get lost if data was cleaned first or some was thrown away.

Florissi adds that big analytics efforts might require multiple data lakes.

Media conglomerate AOL also uses data lakes, says James LaPlaine, the company’s chief information officer. The company engages in billions of transactions per day, and “the time it takes to copy huge data sets is a problem,” he says. Leaving data in native formats and moving it from the point of capture directly to public cloud avoids the cost of copying it over the internal network.

We want all of our rich data in one place so we can have a single source of truth across the company.

Mike Bojdak, Sr. Technology Director at AOL

What Kind of Database to Use

It’s important to choose the right database for an analytics project, with factors like data quantity, formatting and latency all playing a role.

The project where Intel switched databases involved an advanced query “using data from a bunch of noncorrelated sources,” Safa says. Running on an SQL database, the query took four hours. On an in-memory database, the same query took 10 minutes. But he notes that doesn’t make in-memory the right choice for every application. It always comes back to the business goals for the task at hand.

As a starting point, Safa says, consider whether a project is looking for patterns or requires pinpoint accuracy.

Distributed databases like Hadoop that store data in different formats work well for projects focused on finding trends, he says. In these cases, a few inaccurate data points won’t materially change the result.

On the other hand, he says, “If you are trying to determine where specific materials are at a given moment in your manufacturing process, then you need 100 percent accuracy with no latency.”

That requires a database with more structure or controls, tuned for real-time results. Depending on its specific needs, a company might choose an in-memory data-processing framework or a performance-focused NoSQL database. Although many analytics database types have overlapping capabilities, their features are materially different.

Data classification is...labor-intensive, but it’s an important thing to get right.

James LaPlaine, Chief Information Officer at AOL

How to Control Access

In securing big data, IT departments face a familiar tradeoff between preventing inappropriate access and providing adequate access.

Brian Hopkins, vice president and principal analyst at Forrester Research, recommends controlling access via standard perimeter authentication and authorization mechanisms, such as passwords or multi-factor authentication. But companies should also encrypt data and restrict the sharing of data via tokenization, he says.

Other ways to keep data secure are to keep in place the access privileges from the system the data came from, and to restrict access to data that’s been analyzed to the person or team that performed the analysis.

Although AOL is aiming to put all of its rich data in a centralized cloud, it has access controls in place at multiple levels.

An analyst manually reviews data and sets a level of access based on its sensitivity; an authentication system ensures that only people granted that level of access can view the data.

AOL constantly reviews data to make sure it has the correct access classifications for the authentication system, LaPlaine says. “Data classification is a manual process,” says LaPlaine. “It’s labor-intensive, but it’s an important thing to get right,” he says.

“We’re trying to balance meeting the needs of analysts and making sure the data is completely secure,” adds Bojdak.

Download ‘From Data to Action’