Windowing in HiveQL
The concept of windowing, introduced in the SQL:2003 standard, allows the SQL programmer to create a frame from the data against which aggregate and other window functions can operate. HiveQL now supports windowing per the SQL standard. Examples are quite helpful when explaining windowing and aggregate functions.
Departure delays come with the territory when flying is your chosen mode of travel. It comes as no surprise, then, that the RITA-compiled flight data includes this information. “What exactly is the average flight delay per day”? The query in the following listing produces the average departure delay per day in 2008.
(A) hive (flightdata)> CREATE VIEW avgdepdelay AS > SELECT DayOfWeek, AVG(DepDelay) FROM FlightInfo2008 GROUP BY DayOfWeek; OK Time taken: 0.121 seconds (B) hive (flightdata)> SELECT * FROM avgdepdelay; ... OK 1 10.269990244459473 2 8.97689712068735 3 8.289761053658728 4 9.772897177836702 5 12.158036387869656 6 8.645680904903614 7 11.568973392595312 Time taken: 18.6 seconds, Fetched: 7 row(s)
TGIF, or “Thank God It’s Friday,” doesn’t apply to everyone. It should be no surprise that Friday — Day 5 under the results in Step (B) — had the highest number of delays.
Anyway, about that query in Step (A): Hive’s Data Definition Language (DDL) also includes the CREATE VIEW statement, which can be quite useful. In Hive, views allow a query to be saved but data is not stored as with the Create Table as Select (CTAS) statement.
When a view is referenced in HiveQL, Hive executes the query and then uses the results, which could be part of a larger query. This can be very useful to simplify complex queries and break them down into logical components. Additionally, note the GROUP BY clause, which gathers all the days per week and allows the AVG aggregate function to provide a consolidated answer per day.
This information is useful, of course, but what if you want to see some individual numbers per day? Consolidate the data with GROUP BY, and you have the answer you’re looking for, though you’ve lost information as well. Solving this problem of information loss is where windowing becomes quite handy.
Here’s another question about the RITA 2008 flight data that Apache Hive can answer: “What is the first flight between Airport X and Y”? Suppose that in addition to this information, you want to know about subsequent flights, just in case you’re not a “morning person.” Well, this is a job for windowing in HiveQL! The following listing provides you with a query that answers these questions.
(A) hive (flightdata)> SELECT f08.Month, f08.DayOfMonth, cr.description, f08.Origin, f08.Dest, f08.FlightNum, f08.DepTime, MIN(f08.DepTime) OVER (PARTITION BY f08.DayOfMonth ORDER BY f08.DepTime) FROM flightinfo2008 f08 JOIN Carriers cr ON f08.UniqueCarrier = cr.code WHERE f08.Origin = 'JFK' AND f08.Dest = 'ORD' AND f08.Month = 1 AND f08.DepTime != 0; ... OK 1 1 JetBlue Airways JFK ORD 903 641 641 1 1 American Airlines Inc. JFK ORD 1323 833 641 1 1 JetBlue Airways JFK ORD 907 929 641 1 1 Comair Inc. JFK ORD 5083 945 641 1 1 Comair Inc. JFK ORD 5634 1215 641 1 1 JetBlue Airways JFK ORD 915 1352 641 1 1 American Airlines Inc. JFK ORD 1323 833 641 1 1 JetBlue Airways JFK ORD 907 929 641 1 1 Comair Inc. JFK ORD 5083 945 641 1 1 Comair Inc. JFK ORD 5634 1215 641 1 1 JetBlue Airways JFK ORD 915 1352 641 1 1 American Airlines Inc. JFK ORD 1815 1610 641 1 1 JetBlue Airways JFK ORD 917 1735 641 1 1 Comair Inc. JFK ORD 5469 1749 641 1 1 Comair Inc. JFK ORD 5492 2000 641 1 1 JetBlue Airways JFK ORD 919 2102 641 1 31 JetBlue Airways JFK ORD 919 48 48 1 31 JetBlue Airways JFK ORD 903 635 48 1 31 Comair Inc. JFK ORD 5447 650 48 1 31 American Airlines Inc. JFK ORD 1323 840 48 1 31 JetBlue Airways JFK ORD 907 921 48 1 31 JetBlue Airways JFK ORD 917 1859 48
In Step (A), the GROUP BY clause was replaced with the OVER clause where you specify the PARTITION or window over which you want the MIN aggregate function to operate. Also included is the ORDER BY clause so that you can see those subsequent flights after the first one.
As you can see from the listing, on January 31, JetBlue has a nice, early flight at 12:48 a.m. —opt for a later one, at 6:35 a.m. Early-riser issues aside, note that you have retained the information in the query output that would have been lost if you had chosen to use a GROUP BY clause again.
This capability alone makes windowing a powerful feature, and there’s more. Along with windowing in the Hive 0.11 release, the community provided some analytics functions that you can use in conjunction with windowing. Also at your disposal are these functions: RANK, ROW_NUMBER, DENSE_RANK, CUME_DIST, PERCENT_RANK, and NTILE.
Finally, don’t miss the use of JOIN: It’s a real-life, practical example of an inner join in which you join the FlightInfo2008 table with the Carriers table to get the airline name — rather than the cryptic code found in the FlightInfo2008 table.