SQL Access and Apache Hive

By Dirk deRoos

Apache Hive is indisputably the most widespread data query interface in the Hadoop community. Originally, the design goals for Hive were not for full SQL compatibility and high performance, but were to provide an easy, somewhat familiar interface for developers needing to issue batch queries against Hadoop.

This rather piecemeal approach no longer works, so the demand grows for real SQL support and good performance. Hortonworks responded to this demand by creating the Stinger project, where it invested its developer resources in improving Hive to be faster, to scale at a petabyte level, and to be more compliant to SQL standards. This work was to be delivered in three phases.

In Phases 1 and 2, you saw a number of optimizations for how queries were processed as well as added support for traditional SQL data types; the addition of the ORCFile format for more efficient processing and storage; and integration with YARN for better performance.

In Phase 3, the truly significant evolutions take place, which decouple Hive from MapReduce. Specifically, it involves the release of Apache Tez, which is an alternative processing model for Hadoop, designed for interactive workloads.

In addition to the Stinger project, Hortonworks is spearheading an ambitious initiative to enable Hive to support editing data at the row level with full compliance with the ACID properties for database systems: Atomicity, Consistency, Isolation levels, and Durability.