Massively Parallel Processing Databases

By Dirk deRoos

To provide a better understanding of the SQL-on-Hadoop alternatives to Hive, it might be helpful to review a primer on massively parallel processing (MPP) databases first.

Apache Hive is layered on top of the Hadoop Distributed File System (HDFS) and the MapReduce system and presents an SQL-like programming interface to your data (HiveQL, to be precise). This combination of Hadoop technologies deployed on a cluster is similar to MPP databases that have existed for a while in the IT marketplace.

MPP databases usually provide an SQL interface and a relational database management system (RDBMS) running on a cluster of servers networked together by a high-speed interconnect. The figure shows the components of an RDBMS that are typically included in SQL-on-Hadoop solutions.


Relational data systems have evolved considerably to a point where best practices have emerged among most offerings in terms of an optimal query execution infrastructure. The figure shows this in terms of the flow of a query as it’s processed by an RDBMS engine.

First, the query text is parsed and understood. Then the syntax tree for the query is compiled into a logical execution plan, which is then optimized to form the final physical execution plan, which is then executed by the runtime. For many of the SQL-on-Hadoop solutions, you’re seeing similar components being deployed in Hadoop.

MPP clusters are usually referred to as having a Shared-Nothing architecture, because each system has its own CPU, memory and disk. However, through the database software and high-speed interconnects, the system functions as a whole and can scale as new servers are added to the cluster. The overall system is explicitly tuned to provide fast, interactive query response.

MPP databases are often more flexible, scalable, and cost effective than the traditional RDBMS, hosted on a large multiprocessor server.