Joining Tables with Hive
You probably know already that experts in relational database modeling and design typically spend a lot of their time designing normalized databases, or schemas. Database normalization is a technique that guards against data loss, redundancy, and other anomalies as data is updated and retrieved.
The experts follow a number of rules to arrive at a normalized database, but Rule 1 is that you must end up with a group of tables. (One large table storing all your data is not normal — pun intended.) There are exceptions, depending on the use case, but the law of many tables is generally followed closely, especially for databases that support transactions or analytic processing (business intelligence, for example).
When you begin to query and analyze your data, tables are joined based on the defined relationships between them using SQL — which means that the disks are ultimately busy on your server when you start joining tables, and busy disks usually result in slower user response times. However, the good news is that RDBMSs and EDWs are tuned to make joins as fast as possible.
What does all this have to do with joins in Hive? Well, remember that the underlying operating system for Hive is (surprise!) Apache Hadoop: MapReduce is the engine for joining tables, and the Hadoop File System (HDFS) is the underlying storage. It’s all good news for the user who wants to create, manage, and analyze large tables with Hive.
The potential to unlock information that’s hidden in massive data structures is exciting. However, joins with Hive usually don’t perform as well as they do in the RDBMS/EDW world, so first-time users are often surprised by the “pokiness” of the system response.
Remember that MapReduce and HDFS are optimized for throughput with big data analytics and that, in this world, latencies — user response times, in other words — are usually high. Hive is designed for batch-style analytic processing, not for fast online transaction processing. Users who want the best possible performance with SQL on Apache Hadoop have solutions available.
Keep this dynamic in mind when you start joining tables with Hive. Also note that Hive architects usually denormalize their databases to some extent, so having fewer larger tables is commonplace. That’s why complex data types such as STRUCTs and ARRAYs are provided. You can use these complex data types to pack a lot more data into a single table.
Because Hive table reads and writes via HDFS usually involve very large blocks of data, the more data you can manage altogether in one table, the better the overall performance.
Disk and network access is a lot slower than memory access, so minimize HDFS reads and writes as much as possible.
With this background information in mind, you can tackle making joins with Hive. Fortunately, the Hive development community was realistic and understood that users would want and need to join tables with HiveQL. This knowledge becomes especially important with EDW augmentation. Use cases such as “queryable” archives often require joins for data analysis.
Here is a Hive join example using flight data tables. The listing shows you how to create and display a myflightinfo2007 table and a myflightinfo2008 table from the larger FlightInfo2007 and FlightInfo2008 tables. The plan all along was to use the CTAS created myflightinfo2007 and myflightinfo2008 tables to illustrate how you can perform joins in Hive.
The figure shows the result of an inner join with the myflightinfo2007 and myflightinfo2008 tables using the SQuirreL SQL client.
Hive supports equi-joins, a specific type of join that only uses equality comparisons in the join predicate. (ON m8.FlightNum = m7.FlightNum is one example of an equi-join.) Other comparators such as Less Than (<) are not supported. This restriction is only because of limitations on the underlying MapReduce engine. Also, you cannot use OR in the ON clause.
The figure illustrates the earlier example of the inner join and two other Hive join types. Note that you can confirm the results of an inner join by reviewing the contents of the myflight2007 and myflight2008 tables.
The following figure illustrates how an inner join works using a Venn diagram, in case you’re not familiar with the technique. The basic idea here is that an inner join returns the records that match between two tables. So an inner join is a perfect analysis tool to determine which flights are the same from JFK (New York) to ORD (Chicago) in July of 2007 and July of 2008.
Optimizing Hive joins is a hot topic in the Hive community. For more information on current optimization techniques, see the Join Optimization page on the Hive wiki.