Structured Data Storage and Processing in Hadoop - dummies

Structured Data Storage and Processing in Hadoop

By Dirk deRoos

When considering Hadoop’s capabilities for working with structured data (or working with data of any type, for that matter), remember Hadoop’s core characteristics: Hadoop is, first and foremost, a general-purpose data storage and processing platform designed to scale out to thousands of compute nodes and petabytes of data.

There’s no data model in Hadoop itself; data is simply stored on the Hadoop cluster as raw files. As such, the core components of Hadoop itself have no special capabilities for cataloging, indexing, or querying structured data.

The beauty of a general-purpose data storage system is that it can be extended for highly specific purposes. The Hadoop community has done just that with a number of Apache projects — projects that, in totality, make up the Hadoop ecosystem. When it comes to structured data storage and processing, the projects described in this list are the most commonly used:

  • Hive: A data warehousing framework for Hadoop. Hive catalogs data in structured files and provides a query interface with the SQL-like language named HiveQL.

  • HBase: A distributed database — a NoSQL database that relies on multiple computers rather than on a single CPU, in other words — that’s built on top of Hadoop.

  • Giraph: A graph processing engine for data stored in Hadoop.

Many other Apache projects support different aspects of structured data analysis, and some projects focus on a number of frameworks and interfaces.

When determining the optimal architecture for your analytics needs, be sure to evaluate the attributes and capabilities of the systems you’re considering. The table compares Hadoop-based data stores (Hive, Giraph, and HBase) with traditional RDBMS.

A Comparison of Hadoop-Based Storage and RDBMS
Criteria Hive Giraph HBase RDBMS
Changeable data No Yes Yes
Data layout Raw files stored in HDFS; Hive supports proprietary
row-oriented or column-oriented formats.
A sparse, distributed, persistent multidimensional sorted
Row-oriented or column-oriented
Data types Bytes; data types are interpreted on query. Rich data type support
Hardware Hadoop-clustered commodity x86 servers; five or more is typical
because the underlying storage technology is HDFS, which by default
requires three replicas.
Typically large, scalable multiprocessor systems
High availability Yes; built into the Hadoop architecture Yes, if the hardware and RDBMS are configured correctly
Indexes Yes No Row-key only or special table required Yes
Query language HiveQL Giraph API HBase API commands (, , , , , , HiveQL SQL
Schema Schema defined as files are catalogued with the Hive Data
Definition Language (DDL)
Schema on read Variability in schema between rows Schema on load
Throughput Millions of reads and writes per second Thousands of reads and writes per second
Transactions None Provides ACID support on only a single row Provides multi-row and cross-table transactional support with
full ACID property compliance
Transaction speed Modest speed for interactive queries; fast for full table
Fast for interactive queries; fast for full table scans Fast for interactive queries; slower for full table scans
Typical size Ranges from terabytes to petabytes (from hundreds of millions
to billions of rows)
From gigabytes to terabytes (from hundreds of thousands to
millions of rows)