Hadoop as a Data Preprocessing Engine

By Dirk deRoos

One of the earliest use cases for Hadoop in the enterprise was as a programmatic transformation engine used to preprocess data bound for a data warehouse. Essentially, this use case leverages the power of the Hadoop ecosystem to manipulate and apply transformations to data before it’s loaded into a data warehouse.

Though the actual transformation engine is new (it’s Hadoop, so transformations and data flows are coded in Pig or MapReduce, among other languages), the approach itself has been in use awhile with the Extract, Transform, Load (ETL) processes.

Think about the evolution of OLTP and ROLAP databases. Many organizations with operational databases also deployed data warehouses. So how do IT departments get data from their operational databases into their data warehouses? (Remember that the operational data is typically not in a form that lends itself to analysis.)

The answer here is ETL, and as data warehouses increased in use and importance, the steps in the process became well understood and best practices were developed. Also, a number of software companies started offering interesting ETL solutions so that IT departments could minimize their own custom code development.

The basic ETL process is fairly straightforward: you Extract data from an operational database, Transform it into the form you need for your analysis and reporting tools, and then you Load this data into your data warehouse.

One common variation to ETL is ELT — Extract, Load, and Transform. In the ELT process, you perform transformations (in contrast to ETL) after loading the data into the target repository. This approach is often used when the transformation stands to greatly benefit from a very fast SQL processing engine on structured data. (Relational databases may not excel at processing unstructured data, but they perform very fast processing of — guess what? — structured data.)

If the data you’re transforming is destined for a data warehouse, and many of those transformations can be done in SQL, you may choose to run the transformations in the data warehouse itself. ELT is especially appealing if the bulk of your skill set lies with SQL-based tooling.

With Hadoop now able to process SQL queries, both ETL and ELT workloads can be hosted on Hadoop. The figure shows ETL services added to the reference architecture.

image0.jpg

If you’ve deployed a Hadoop-based landing zone, you’ve got almost everything you need in place to use Hadoop as a transformation engine. You’re already landing data from your operational systems into Hadoop using Sqoop, which covers the extraction step. At this point you’ll need to implement your transformation logic into MapReduce or Pig applications. After the data is transformed, you can load the data into the data warehouse using Sqoop.

Using Hadoop as a data transformation engine raises possibilities as well. If your data warehouse doesn’t modify its data (it’s for reporting only), you can simply keep the data you generate with the transformation process. In this model, data only flows from left-to-right in the figure, where data is extracted from operational databases, transformed in the landing zone, and then loaded into the data warehouse.

With all the transformed data already in the landing zone, there’s no need to copy it back to Hadoop — unless, of course, the data gets modified in the warehouse.