Hadoop as an Archival Data Destination - dummies

Hadoop as an Archival Data Destination

By Dirk deRoos

The inexpensive cost of storage for Hadoop plus the ability to query Hadoop data with SQL makes Hadoop the prime destination for archival data. This use case has a low impact on your organization because you can start building your Hadoop skill set on data that’s not stored on performance-mission-critical systems.

What’s more, you don’t have to work hard to get at the data. (Since archived data normally is stored on systems that have low usage, it’s easier to get at than data that’s in “the limelight” on performance-mission-critical systems, like data warehouses.) If you’re already using Hadoop as a landing zone, you have the foundation for your archive! You simply keep what you want to archive and delete what you don’t.

If you think about the Hadoop’s landing zone, the queryable archive, shown in the figure, extends the value of Hadoop and starts to integrate pieces that likely already exist in your enterprise. It’s a great example of finding economies of scale and cost take-out opportunities using Hadoop.


Here, the archive component connects the landing zone and the data warehouse. The data being archived originates in the warehouse and is then stored in the Hadoop cluster, which is also provisioning the landing zone. In short, you can use the same Hadoop cluster to archive data and act as your landing zone.

The key Hadoop technology you would use to perform the archiving is Sqoop, which can move the data to be archived from the data warehouse into Hadoop. You will need to consider what form you want the data to take in your Hadoop cluster. In general, compressed Hive files are a good choice.

You can, of course, transform the data from the warehouse structures into some other form (for example, a normalized form to reduce redundancy), but this is generally not a good idea. Keeping the data in the same structure as what’s in the warehouse will make it much easier to perform a full data set query across the archived data in Hadoop and the active data that’s in the warehouse.

The concept of querying both the active and archived data sets brings up another consideration: how much data should you archive? There are really two common choices: archive everything as data is added and changed in the data warehouse, or only archive the data you deem to be cold.

Archiving everything has the benefit of enabling you to easily issue queries from one single interface across the entire data set — without a full archive, you’ll need to figure out a federated query solution where you would have to union the results from the archive and the active data warehouse.

But the downside here is that regular updates of your data warehouse’s hot data would cause headaches for the Hadoop-based archive. This is because any changes to data in individual rows and columns would require wholesale deletion and re-cataloging of existing data sets.

Now that archival data is stored in your Hadoop-based landing zone (assuming you’re using an option like the compressed Hive files mentioned previously), you can query it. This is where the SQL on Hadoop solutions can become interesting.

An excellent example of what’s possible is for the analysis tools (on the right in the figure) to directly run reports or analysis on the archived data stored in Hadoop. This is not to replace the data warehouse — after all, Hadoop would not be able to match the warehouse’s performance characteristics for supporting hundreds or more concurrent users asking complex questions.

The point here is that you can use reporting tools against Hadoop to experiment and come up with new questions to answer in a dedicated warehouse or mart.

When you start your first Hadoop-based project for archiving warehouse data, don’t break the current processes until you’ve fully tested them on your new Hadoop solution. In other words, if your current warehousing strategy is to archive to tape, keep that process in place, and dual-archive the data into Hadoop and tape until you’ve fully tested the scenario (which would typically include restoring the warehouse data in case of a warehouse failure).

Though you’re maintaining (in the short term) two archive repositories, you’ll have a robust infrastructure in place and tested before you decommission a tried-and-true process. This process can ensure that you remain employed — with your current employer.

This use case is simple because there’s no change to the existing warehouse. The business goal is still the same: cheaper storage and licensing costs by migrating rarely-used data to an archive. The difference in this case is that the technology behind the archive is Hadoop rather than offline storage, like tape.

In addition, various archive vendors have started to incorporate Hadoop into their solutions (for example, allowing their proprietary archive files to reside on HDFS), so expect capabilities in this area to expand soon.

As you develop Hadoop skills (like exchanging data between Hadoop and relational databases and querying data in HDFS) you can use them to tackle bigger problems, such as analysis projects, which could provide additional value for your organization’s Hadoop investment.