Big Data For Dummies
Book image
Explore Book Buy On Amazon

Traditional business intelligence products weren’t really designed to handle big data, so they may require some modification. They were designed to work with highly structured, well-understood data, often stored in a relational data repository and displayed on your desktop or laptop computer. This traditional business intelligence analysis is typically applied to snapshots of data rather than the entire amount of data available. What’s different with big data analysis?

Big data data

Big data consists of structured, semi-structured, and unstructured data. You often have a lot of it, and it can be quite complex. When you think about analyzing it, you need to be aware of the potential characteristics of your data:

  • It can come from untrusted sources. Big data analysis often involves aggregating data from various sources. These may include both internal and external data sources. How trustworthy are these external sources of information? For example, how trustworthy is social media data like a tweet? The information may be coming from an unverified source. The integrity of this data needs to be considered in the analysis.

  • It can be dirty. Dirty data refers to inaccurate, incomplete, or erroneous data. This may include the misspelling of words; a sensor that is broken, not properly calibrated, or corrupted in some way; or even duplicated data. Data scientists debate about where to clean the data — either close to the source or in real time.

    Of course, one school of thought says that the dirty data should not be cleaned at all because it may contain interesting outliers. The cleansing strategy will probably depend on the source and type of data and the goal of your analysis. For example, if you’re developing a spam filter, the goal is to detect the bad elements in the data, so you would not want to clean it.

  • The signal-to-noise ratio can be low. In other words, the signal (usable information) may only be a tiny percent of the data; the noise is the rest. Being able to extract a tiny signal from noisy data is part of the benefit of big data analytics, but you need to be aware that the signal may indeed be small.

  • It can be real-time. In many cases, you’ll be trying to analyze real-time data streams.

Big data governance is going to be an important part of the analytics equation. Underneath business analytics, enhancements will need to be made to governance solutions to ensure the veracity coming from the new data sources, especially as it is being combined with existing trusted data stored in a warehouse. Data security and privacy solutions also need to be enhanced to support managing/governing big data stored within new technologies.

Analytical big data algorithms

When you’re considering big data analytics, you need to be aware that when you expand beyond the desktop, the algorithms you use often need to be refactored, changing the internal code without affecting its external functioning. The beauty of a big data infrastructure is that you can run a model that used to take hours or days in minutes.

This lets you iterate on the model hundreds of times over. However, if you’re running a regression on a billion rows of data across a distributed environment, you need to consider the resource requirements relating to the volume of data and its location in the cluster. Your algorithms need to be data aware.

Additionally, vendors are starting to offer new analytics designed to be placed close to the big data sources to analyze data in place. This approach of running analytics closer to the data sources minimizes the amount of stored data by retaining only the high-value data. It is also enables you to analyze the data sooner, which is critical for real-time decision making.

Of course, analytics will continue to evolve. For example, you may need real-time visualization capabilities to display real-time data that is continuously changing. How do you practically plot a billion points on a graph plot? Or, how do you work with the predictive algorithms so that they perform fast enough and deep enough analysis to utilize an ever-expanding, complex data set? This is an area of active research.

Big data infrastructure support

Suffice it to say that if you’re looking for a platform, it needs to achieve the following:

  • Integrate technologies: The infrastructure needs to integrate new big data technologies with traditional technologies to be able to process all kinds of big data and make it consumable by traditional analytics.

  • Store large amounts of disparate data: An enterprise-hardened Hadoop system may be needed that can process/store/manage large amounts of data at rest, whether it is structured, semi-structured, or unstructured.

  • Process data in motion: A stream-computing capability may be needed to process data in motion that is continuously generated by sensors, smart devices, video, audio, and logs to support real-time decision making.

  • Warehouse data: You may need a solution optimized for operational or deep analytical workloads to store and manage the growing amounts of trusted data.

And of course, you need the capability to integrate the data you already have in place along with the results of the big data analysis.

About This Article

This article is from the book:

About the book authors:

Judith Hurwitz is an expert in cloud computing, information management, and business strategy. Alan Nugent has extensive experience in cloud-based big data solutions. Dr. Fern Halper specializes in big data and analytics. Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

This article can be found in the category: