Data Transformation in Hadoop

By Dirk deRoos

The idea of Hadoop-inspired ETL engines has gained a lot of traction in recent years. After all, Hadoop is a flexible data storage and processing platform that can support huge amounts of data and operations on that data. At the same time, it’s fault tolerant, and it offers the opportunity for capital and software cost reductions.

Despite Hadoop’s popularity as an ETL engine, however, many folks (including a famous firm of analysts) don’t recommend Hadoop as the sole piece of technology for your ETL strategy. This is largely because developing ETL flows requires a great deal of expertise about your organization’s existing database systems, the nature of the data itself, and the reports and applications dependent on it.

In other words, the DBAs, developers, and architects in your IT department would need to become familiar enough with Hadoop to implement the needed ETL flows. For example, a lot of intensive hand coding with Pig, Hive, or even MapReduce may be necessary to create even the simplest of data flows — which puts your company on the hook for those skills if it follows this path.

You have to code elements such as parallel debugging, application management services (such as check pointing and error and event handling). Also, consider enterprise requirements such as glossarization and being able to show your data’s lineage.

There are regulatory requirements for many industry standard reports, where data lineage is needed; the reporting organization must be able to show where the data points in the report come from, how the data got to you, and what has been done to the data.

Even for relational database systems, ETL is complex enough that there are popular specialized products that provide interfaces for managing and developing ETL flows. Some of these products now aid in Hadoop-based ETL and other Hadoop-based development. However, depending on your requirements, you may need to write some of your own code to support your transformation logic.