Cloudera Impala and Hadoop - dummies

Cloudera Impala and Hadoop

By Dirk deRoos

Cloudera is a leading Apache Hadoop software and services provider in the big data market. Like Apache Drill, Cloudera’s Impala technology seeks to improve interactive query response time for Hadoop users. Apache Hive has provided a familiar and powerful query mechanism for Hadoop users, but query response times are often unacceptable due to Hive’s reliance on MapReduce. Cloudera’s answer to this problem is Impala.

Cloudera has developed an MPP query engine, written in C++, to replace the MapReduce layer leveraged by Apache Hive. Unlike Dremel and Drill, Cloudera decided that a native C++ MPP engine — instead of a Java engine — was the answer for fast, interactive Hadoop queries.

Note that Impala uses HiveQL as a programming interface, and Impala’s Query Exec Engines are co-located with HDFS data nodes, in keeping with the Hadoop approach of co-locating data with processing tasks. Impala can also use HBase as a data store. In this sense, Impala is an extension to Apache Hadoop, providing a very high-performance alternative to the Hive-on-top-of-MapReduce model.

Cloudera and Twitter led the development of the new Hadoop file format, which can be used with Impala and is available as open source on GitHub. The Parquet file format provides a robust columnar medium for storing data in Hadoop. It supports highly efficient compression and encoding, and is effective for storing nested data structures.

You can find Cloudera’s Impala technology, which also was inspired by Google’s Dremel invention.