Big Data: The Need for Metadata in Data Streams

Judith S. Hurwitz

Alan Nugent

Fern Halper

Marcia Kaufman

Updated

2016-03-26 15:03:06

From the book

Big Data For Dummies

Download E-Book

Statistics for Big Data For Dummies

Explore Book

Download E-Book

Statistics for Big Data For Dummies

Explore Book

Most big data management professionals are familiar with the need to manage metadata in structured database management environments. These data sources are strongly typed (for example, the first ten characters are the first name) and designed to operate with metadata. You might assume that metadata is nonexistent in unstructured data, but that is not true.

Typically you find structure in any kind of data. Take the example of video. Although you might not be able to know exactly the content of a specific video, a lot of structure exists in the format of that video-based data. If you are looking at unstructured text, you know that the words are written in English and that if you apply the right tools, you can interpret the text.

Because of this implicit metadata from unstructured data, it is possible to parse the information using eXtensible Markup Language (XML). XML is a technique for presenting unstructured text files with meaningful tags. The underlying technology is not new and was one of the foundational technologies for implementing service orientation.

Examples of products for streaming data include IBM’s InfoSphere Streams, Twitter’s Storm, and Yahoo’s S4.

Big data and IBM InfoSphere Streams

InfoSphere Streams provides continuous analysis of massive data volumes. It is intended to perform complex analytics of heterogeneous data types, including text, images, audio, voice, VoIP, video, web traffic, e-mail, GPS data, financial transaction data, satellite data, and sensors. Infosphere Streams can support all data types. It can perform real-time and look-ahead analysis of regularly generated data, using digital filtering, pattern/correlation analysis, and decomposition as well as geospacial analysis.

Big data and Twitter’s Storm

Twitter’s Storm is an open source real-time analytics engine developed by a company called BackType that was acquired by Twitter in 2011 partially because Twitter uses Storm internally. It is still available as open source and has been gaining significant traction among emerging companies.

It can be used with any programming language for applications such as real-time analytics, continuous computation, distributed remote procedure calls (RPCs), and integration. Storm is designed to work with existing queuing and database technologies. Companies using Storm in their big data implementations include Groupon, RocketFuel, Navisite, and Oolgala.

Big data and Apache S4

The four S’s in S4 stand for Simple Scalable Streaming System. Apache S4 was developed by Yahoo! as a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous streams of data. The core platform is written in Java and was released by Yahoo! in 2010.

A year later, it was turned over to Apache under the Apache 2.0 license. Clients that send and receive events can be written in any programming language. S4 is designed as a highly distributed system. Throughput can be increased linearly by adding nodes into a cluster. The S4 design is best suited for large-scale applications for data mining and machine learning in a production environment.

About This Article

About the book author:

Judith Hurwitz is an expert in cloud computing, information management, and business strategy.

Alan Nugent has extensive experience in cloud-based big data solutions.

Dr. Fern Halper specializes in big data and analytics.

Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

This article can be found in the category:

Big Data

Hot off the press

Explore Related content

Statistics for Big Data For Dummies

Big Data For Dummies

Big Data For Small Business For Dummies

Book & Article Categories

Book & Article Categories

Collections

Big Data: The Need for Metadata in Data Streams

Big data and IBM InfoSphere Streams

Big data and Twitter’s Storm

Big data and Apache S4

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

Big Data: The Need for Metadata in Data Streams

Big data and IBM InfoSphere Streams

Big data and Twitter’s Storm

Big data and Apache S4

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

Beyond Boundaries: Unstructured Data Orchestration

Big Data For Dummies Cheat Sheet

Statistics for Big Data For Dummies Cheat Sheet

Big Data for Small Business For Dummies Cheat Sheet

Integrate Big Data with the Traditional Data Warehouse

Best Practices for Big Data Integration

How to Analyze Big Data to Get Results

Big Data Planning Stages

Ten Hot Big Data Trends

Explore the Big Data Stack

Defining Big Data: Volume, Velocity, and Variety

Understanding Unstructured Data

Basics of Big Data Infrastructure

The Role of Traditional Operational Data in the Big Data Environment

Laying the Groundwork for Your Big Data Strategy

Managing Big Data with Hadoop: HDFS and MapReduce

Identify the Data You Need for Your Big Data

Layer 2 of the Big Data Stack: Operational Databases

Manage Virtualization for Big Data

Layer 4 of the Big Data Stack: Analytical Data Warehouses