How to Use Data Streaming for Big Data

By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman

Sometimes, when approaching big data, companies are faced with huge amounts of data and little idea of where to go next. Enter data streaming. When a significant amount of data needs to be quickly processed in near real time to gain insights, data in motion in the form of streaming data is the best answer.

What is data that is not at rest? This would be systems that are managing active transactions and therefore need to have persistence. In these cases, the data will be stored in an operational data store. However, in other situations, those transactions have been executed, and it is time to analyze that data typically in a data warehouse or data mart.

This means that the information is being processed in batch and not in real time. When organizations are planning for their future, they need to be able to analyze lots of data, ranging from information about what customers are buying and why. It is important to understand the leading indicators of change. In other words, how will changes impact what products and services an organization will offer in the future?

Many research organizations are using this type of big data analytics to discover new medicines. An insurance company may want to compare the patterns of traffic accidents across a broad geographic area with weather statistics. In these cases, no benefit exists to manage this information at real-time speed. Clearly, the analysis has to be fast and practical. In addition, organizations will analyze the data to see whether new patterns emerge.

Streaming data is an analytic computing platform that is focused on speed. This is because these applications require a continuous stream of often unstructured data to be processed. Therefore, data is continuously analyzed and transformed in memory before it is stored on a disk. Processing streams of data works by processing “time windows” of data in memory across a cluster of servers.

This is similar to the approach when managing data at rest leveraging Hadoop. The primary difference is the issue of velocity. In the Hadoop cluster, data is collected in batch mode and then processed. Speed matters less in Hadoop than it does in data streaming. Some key principles define when using streams is most appropriate:

  • When it is necessary to determine a retail buying opportunity at the point of engagement, either via social media or via permission-based messaging

  • Collecting information about the movement around a secure site

  • To be able to react to an event that needs an immediate response, such as a service outage or a change in a patient’s medical condition

  • Real-time calculation of costs that are dependent on variables such as usage and available resources

Streaming data is useful when analytics need to be done in real time while the data is in motion. In fact, the value of the analysis (and often the data) decreases with time. For example, if you can’t analyze and act immediately, a sales opportunity might be lost or a threat might go undetected.

The following are some examples that can help explain how this is useful.

A power plant needs to be a highly secure environment so that unauthorized individuals do not interfere with the delivery of power to customers. Companies often place sensors around the perimeter of a site to detect movement. But a problem could exist. A huge difference exists between a rabbit that scurries around the site and a car driving by quickly and deliberately. Therefore, the vast amount of data coming from these sensors needs to be analyzed in real time so that an alarm is sounded only when an actual threat exists.

A telecommunications company in a highly competitive market wants to make sure that outages are carefully monitored so that a detected drop in service levels can be escalated to the appropriate group. Communications systems generate huge volumes of data that have to be analyzed in real time to take the appropriate action. A delay in detecting an error can seriously impact customer satisfaction.

Needless to say, businesses are dealing with a lot of data that needs to be processed and analyzed in real time. Therefore, the physical environment that supports this level of responsiveness is critical. Streaming data environments typically require a clustered hardware solution, and sometimes a massively parallel processing approach will be required to handle the analysis.

One important factor about streaming data analysis is the fact that it is a single-pass analysis. In other words, the analyst cannot reanalyze the data after it is streamed. This is common in applications where you are looking for the absence of data.

If several passes are required, the data will have to be put into some sort of warehouse where additional analysis can be performed. For example, it is often necessary to establish context. How does this streaming data compare to historical data? This correlation can tell you a lot about what has changed and what that change might mean to your business.