Improving Your Hive Queries with Indexes - dummies

Improving Your Hive Queries with Indexes

By Dirk deRoos

Creating an index is common practice with relational databases when you want to speed access to a column or set of columns in your database. Without an index, the database system has to read all rows in the table to find the data you have selected. Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables.

As you would expect, Hive supports index creation on tables, though its functionality is still somewhat immature. However, the Hive community is active, and indexing will eventually mature. Even with its current limitations, indexing offers an approach to speed up Hive queries with little effort.

You can optimize Hive queries in at least five ways: First, with a little research, you can often speed your joins by leveraging certain optimization techniques, as described on the Hive wiki. Second, column-oriented storage options can be quite helpful. Remember that the ORC file format is new as of Hive 0.11.

Third, you can partition tables. Fourth, the Hive community has provided indexing. Finally, don’t forget the configuration variable.

In the following are the steps necessary to index the FlightInfo2008 table. This extremely large table has millions of rows, so it makes a good candidate for an index or two.

(B) ALTER INDEX f08_index ON flightinfo2008 REBUILD;
(C) hive (flightdata)> SHOW INDEXES ON FlightInfo2008;
f08index                flightinfo2008          origin                  flightdata__flightinfo2008_f08index__ compact
Time taken: 0.079 seconds, Fetched: 1 row(s)
(D) hive (flightdata)> DESCRIBE flightdata__flightinfo2008_f08index__;
origin                  string                  None
_bucketname             string
_offsets                array<bigint>
Time taken: 0.112 seconds, Fetched: 3 row(s)
(E) hive (flightdata)> SELECT Origin, COUNT(1) FROM flightinfo2008 WHERE Origin = 'SYR' GROUP BY Origin;
SYR     12032
Time taken: 17.34 seconds, Fetched: 1 row(s)
(F) hive (flightdata)> SELECT Origin, SIZE(`_offsets`) FROM flightdata__flightinfo2008_f08index__ WHERE origin = 'SYR';
SYR     12032
Time taken: 8.347 seconds, Fetched: 1 row(s)
(G) hive (flightdata)> DESCRIBE flightdata__flightinfo2008_f08index__;
origin                  string                  None
_bucketname             string
_offsets                array<bigint>
Time taken: 0.12 seconds, Fetched: 3 row(s)

Step (A) creates the index using the ‘COMPACT’ index handler on the Origin column. Hive also offers a bitmap index handler as of the 0.8 release, which is intended for creating indexes on columns with a few unique values.

In Step (A) the keywords WITH DEFERRED REBUILD instructs Hive to first create an empty index; Step (B) is where you actually build the index with the ALTER INDEX … REBUILD command. Deferred index builds can be very useful in workflows where one process creates the tables and indexes, another loads the data and builds the indexes and a final process performs data analysis.

Hive doesn’t provide automatic index maintenance, so you need to rebuild the index if you overwrite or append data to the table. Also, Hive indexes support table partitions, so a rebuild can be limited to a partition. Step (C) illustrates how you can list or show the indexes created against a particular table.

Step (D) illustrates an important point regarding Hive indexes: Hive indexes are implemented as tables. This is why you need to first create the index table and then build it to populate the table. Therefore, you can use indexes in at least two ways:

  • Count on the system to automatically use indexes that you create.

  • Rewrite some queries to leverage the new index table.

The automatic use of indexes is progressing, but this aspect is a work in progress. Focusing on the second option, in Step (E) you write a query that seeks to determine how many flights left the Syracuse airport during 2008. To get this information, you leverage the COUNT aggregate function.

You can see that Hive took 17.32 seconds on the virtual machine to report that 12,032 flights originated from Syracuse, New York.

In Step (F), you leverage the new index table and use the SIZE function instead. Step (F) makes more sense after you study Step (D): Step (D) shows you what an index table looks like, where records each hold the column _bucketname, which is the location of the data in the Hive warehouse (/home/biadmin/Hive/warehouse, in this case), and an _offsets array, which is the index into the table (FlightInfo2008) in this case.

So now the query in Step (F) makes sense. All Hive has to do is find the SYR origin (for Syracuse) in the flightdata__flightinfo2008_f08index__ table and then count the rows in the _offsets array to get the number of flights — a simple yet elegant way to double the performance (8.347 secs in Step (F) versus 17.34 in Step (E)) of the original query.