Hadoop as a Queryable Archive of Cold Warehouse Data - dummies

Hadoop as a Queryable Archive of Cold Warehouse Data

By Dirk deRoos

A multitude of studies show that most data in an enterprise data warehouse is rarely queried. Database vendors have responded to such observations by implementing their own methods for sorting out what data gets placed where.

One method orders the data universe into designations of hot, warm, or cold, where hot data (sometimes called active data) is used often, warm data is used from time to time; and cold data is rarely used. The proposed solution for many vendors is to store the cold data on slower disks within the data warehouse enclosures or to create clever caching strategies to keep the hot data in-memory, among others.

The problem with this approach is that even though slower storage is used, it’s still expensive to store cold, seldom used data in a warehouse. The costs here stems from both hardware and software licensing. At the same time, cold and dormant data is often archived to tape.

This traditional model of archiving data breaks down when you want to query all cold data in a cost-effective and relatively efficient way — without having to request old tapes, in other words.

If you look at the cost and operational characteristics of Hadoop, indeed it seems that it’s set to become the new backup tape. Hadoop is inexpensive largely because Hadoop systems are designed to use a lower grade of hardware than what’s normally deployed in data warehouse systems. Another significant cost savings is software licensing.

Commercial Hadoop distribution licenses require a fraction of the cost of relational data warehouse software licenses, which are notorious for being expensive. From an operational perspective, Hadoop is designed to easily scale just by adding additional slave nodes to an existing cluster. And as slave nodes are added and data sets grow in volume, Hadoop’s data processing frameworks enable your applications to seamlessly handle the increased workload.

Hadoop represents a simple, flexible, and inexpensive way to push processing across literally thousands of servers.

With its scalable and inexpensive architecture, Hadoop would seem to be a perfect choice for archiving warehouse data . . . except for one small matter: Most of the IT world runs on SQL, and SQL on its own doesn’t play well with Hadoop.

Sure, the more Hadoop-friendly NoSQL movement is alive and well, but most power users now use SQL by way of common, off-the-shelf toolsets that generate SQL queries under the hood — products such as Tableau, Microsoft Excel, and IBM Cognos BI.

It’s true that the Hadoop ecosystem includes Hive, but Hive supports only a subset of SQL, and although performance is improving (along with SQL support), it’s not nearly as fast at answering smaller queries as relational systems are. Recently, there has been major progress around SQL access to Hadoop, which has paved the way for Hadoop to become the new destination for online data warehouse archives.

Depending on the Hadoop vendor, SQL (or SQL-like) APIs are becoming available so that the more common off-the-shelf reporting and analytics tools can seamlessly issue SQL that executes on data stored in Hadoop. For example, IBM has its Big SQL API, Cloudera has Impala, and Hive itself, via the Hortonworks Stinger initiative, is becoming increasingly SQL compliant.

Though various points of view exist (some aim to enhance Hive; some, to extend Hive; and others, to provide an alternative), all these solutions attempt to tackle two issues: MapReduce is a poor solution for executing smaller queries, and SQL access is — for now — the key to enabling IT workers to use their existing SQL skills to get value out of data stored in Hadoop.