Big Data: The Need for Metadata in Data Streams

Statistics for Big Data For Dummies

Most big data management professionals are familiar with the need to manage metadata in structured database management environments. These data sources are strongly typed (for example, the first ten characters are the first name) and designed to operate with metadata. You might assume that metadata is nonexistent in unstructured data, but that is not true.

Typically you find structure in any kind of data. Take the example of video. Although you might not be able to know exactly the content of a specific video, a lot of structure exists in the format of that video-based data. If you are looking at unstructured text, you know that the words are written in English and that if you apply the right tools, you can interpret the text.

Because of this implicit metadata from unstructured data, it is possible to parse the information using eXtensible Markup Language (XML). XML is a technique for presenting unstructured text files with meaningful tags. The underlying technology is not new and was one of the foundational technologies for implementing service orientation.

Examples of products for streaming data include IBM’s InfoSphere Streams, Twitter’s Storm, and Yahoo’s S4.

Big data and IBM InfoSphere Streams

InfoSphere Streams provides continuous analysis of massive data volumes. It is intended to perform complex analytics of heterogeneous data types, including text, images, audio, voice, VoIP, video, web traffic, e-mail, GPS data, financial transaction data, satellite data, and sensors. Infosphere Streams can support all data types. It can perform real-time and look-ahead analysis of regularly generated data, using digital filtering, pattern/correlation analysis, and decomposition as well as geospacial analysis.

Big data and Twitter’s Storm

Twitter’s Storm is an open source real-time analytics engine developed by a company called BackType that was acquired by Twitter in 2011 partially because Twitter uses Storm internally. It is still available as open source and has been gaining significant traction among emerging companies.

It can be used with any programming language for applications such as real-time analytics, continuous computation, distributed remote procedure calls (RPCs), and integration. Storm is designed to work with existing queuing and database technologies. Companies using Storm in their big data implementations include Groupon, RocketFuel, Navisite, and Oolgala.

Big data and Apache S4

The four S’s in S4 stand for Simple Scalable Streaming System. Apache S4 was developed by Yahoo! as a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous streams of data. The core platform is written in Java and was released by Yahoo! in 2010.

A year later, it was turned over to Apache under the Apache 2.0 license. Clients that send and receive events can be written in any programming language. S4 is designed as a highly distributed system. Throughput can be increased linearly by adding nodes into a cluster. The S4 design is best suited for large-scale applications for data mining and machine learning in a production environment.