Structured Data Storage and Processing in Hadoop
When considering Hadoop’s capabilities for working with structured data (or working with data of any type, for that matter), remember Hadoop’s core characteristics: Hadoop is, first and foremost, a general-purpose data storage and processing platform designed to scale out to thousands of compute nodes and petabytes of data.
There’s no data model in Hadoop itself; data is simply stored on the Hadoop cluster as raw files. As such, the core components of Hadoop itself have no special capabilities for cataloging, indexing, or querying structured data.
The beauty of a general-purpose data storage system is that it can be extended for highly specific purposes. The Hadoop community has done just that with a number of Apache projects — projects that, in totality, make up the Hadoop ecosystem. When it comes to structured data storage and processing, the projects described in this list are the most commonly used:
Hive: A data warehousing framework for Hadoop. Hive catalogs data in structured files and provides a query interface with the SQL-like language named HiveQL.
HBase: A distributed database — a NoSQL database that relies on multiple computers rather than on a single CPU, in other words — that’s built on top of Hadoop.
Giraph: A graph processing engine for data stored in Hadoop.
Many other Apache projects support different aspects of structured data analysis, and some projects focus on a number of frameworks and interfaces.
When determining the optimal architecture for your analytics needs, be sure to evaluate the attributes and capabilities of the systems you’re considering. The table compares Hadoop-based data stores (Hive, Giraph, and HBase) with traditional RDBMS.
|Data layout||Raw files stored in HDFS; Hive supports proprietary
row-oriented or column-oriented formats.
|A sparse, distributed, persistent multidimensional sorted
|Row-oriented or column-oriented|
|Data types||Bytes; data types are interpreted on query.||Rich data type support|
|Hardware||Hadoop-clustered commodity x86 servers; five or more is typical
because the underlying storage technology is HDFS, which by default
requires three replicas.
|Typically large, scalable multiprocessor systems|
|High availability||Yes; built into the Hadoop architecture||Yes, if the hardware and RDBMS are configured correctly|
|Indexes||Yes||No||Row-key only or special table required||Yes|
|Query language||HiveQL||Giraph API||HBase API commands (, , , , , , HiveQL||SQL|
|Schema||Schema defined as files are catalogued with the Hive Data
Definition Language (DDL)
|Schema on read||Variability in schema between rows||Schema on load|
|Throughput||Millions of reads and writes per second||Thousands of reads and writes per second|
|Transactions||None||Provides ACID support on only a single row||Provides multi-row and cross-table transactional support with
full ACID property compliance
|Transaction speed||Modest speed for interactive queries; fast for full table
|Fast for interactive queries; fast for full table scans||Fast for interactive queries; slower for full table scans|
|Typical size||Ranges from terabytes to petabytes (from hundreds of millions
to billions of rows)
|From gigabytes to terabytes (from hundreds of thousands to
millions of rows)