The Hybrid Data Preprocess Option in Hadoop - dummies

The Hybrid Data Preprocess Option in Hadoop

By Dirk deRoos

In addition to having to store larger volumes of cold data, one pressure you see in traditional data warehouses is that increasing amounts of processing resources are being used for transformation (ELT) workloads.

The idea behind using Hadoop as a preprocessing engine to handle data transformation means that precious processing cycles are freed up, allowing the data warehouse to adhere to its original purpose: Answer repeated business questions to support analytic applications. Again, you’re seeing how Hadoop can complement traditional data warehouse deployments and enhance their productivity.

Perhaps a tiny, imaginary light bulb has lit up over your head and you’re thinking, “Hey, maybe there are some transformation tasks perfectly suited for Hadoop’s data processing ability, but I know there’s also a lot of transformation work steeped in algebraic, step-by-step tasks where running SQL on a relational database engine would be the better choice. Wouldn’t it be cool if I could run SQL on Hadoop?”

SQL on Hadoop is already here. With the ability to issue SQL queries against data in Hadoop, you’re not stuck with only an ETL approach to your data flows — you can also deploy ELT-like applications.

Another hybrid approach to consider is where to run your transformation logic: in Hadoop or in the data warehouse? Although some organizations are concerned about running anything but analytics in their warehouses, the fact remains that relational databases are excellent at running SQL, and could be a more practical place to run a transformation than Hadoop.