By Dirk deRoos

The concept of windowing, introduced in the SQL:2003 standard, allows the SQL programmer to create a frame from the data against which aggregate and other window functions can operate. HiveQL now supports windowing per the SQL standard. Examples are quite helpful when explaining windowing and aggregate functions.

Departure delays come with the territory when flying is your chosen mode of travel. It comes as no surprise, then, that the RITA-compiled flight data includes this information. “What exactly is the average flight delay per day”? The query in the following listing produces the average departure delay per day in 2008.

(A) hive (flightdata)> CREATE VIEW avgdepdelay AS
           > SELECT DayOfWeek, AVG(DepDelay) 
           FROM FlightInfo2008 GROUP BY DayOfWeek;
OK
Time taken: 0.121 seconds
(B) hive (flightdata)> SELECT * FROM avgdepdelay;
...
OK
1       10.269990244459473
2       8.97689712068735
3       8.289761053658728
4       9.772897177836702
5       12.158036387869656
6       8.645680904903614
7       11.568973392595312
Time taken: 18.6 seconds, Fetched: 7 row(s)

TGIF, or “Thank God It’s Friday,” doesn’t apply to everyone. It should be no surprise that Friday — Day 5 under the results in Step (B) — had the highest number of delays.

Anyway, about that query in Step (A): Hive’s Data Definition Language (DDL) also includes the CREATE VIEW statement, which can be quite useful. In Hive, views allow a query to be saved but data is not stored as with the Create Table as Select (CTAS) statement.

When a view is referenced in HiveQL, Hive executes the query and then uses the results, which could be part of a larger query. This can be very useful to simplify complex queries and break them down into logical components. Additionally, note the GROUP BY clause, which gathers all the days per week and allows the AVG aggregate function to provide a consolidated answer per day.

This information is useful, of course, but what if you want to see some individual numbers per day? Consolidate the data with GROUP BY, and you have the answer you’re looking for, though you’ve lost information as well. Solving this problem of information loss is where windowing becomes quite handy.

Here’s another question about the RITA 2008 flight data that Apache Hive can answer: “What is the first flight between Airport X and Y”? Suppose that in addition to this information, you want to know about subsequent flights, just in case you’re not a “morning person.” Well, this is a job for windowing in HiveQL! The following listing provides you with a query that answers these questions.

(A) hive (flightdata)> SELECT f08.Month, f08.DayOfMonth,
    cr.description, f08.Origin, f08.Dest, f08.FlightNum,
    f08.DepTime, MIN(f08.DepTime)
OVER (PARTITION BY f08.DayOfMonth ORDER BY f08.DepTime)
FROM flightinfo2008 f08 JOIN Carriers cr ON 
    f08.UniqueCarrier = cr.code
WHERE f08.Origin = 'JFK' AND f08.Dest = 'ORD' AND 
    f08.Month = 1 AND f08.DepTime != 0;
...
OK
1     1  JetBlue Airways         JFK ORD 903   641 641
1     1  American Airlines Inc.  JFK ORD 1323  833  641
1     1  JetBlue Airways         JFK ORD 907   929  641
1     1  Comair Inc.             JFK ORD 5083  945  641
1     1  Comair Inc.             JFK ORD 5634  1215 641
1     1  JetBlue Airways         JFK ORD 915   1352 641
1     1  American Airlines Inc.  JFK ORD 1323  833  641
1     1  JetBlue Airways         JFK ORD 907   929  641
1     1  Comair Inc.             JFK ORD 5083  945  641
1     1  Comair Inc.             JFK ORD 5634  1215 641
1     1  JetBlue Airways         JFK ORD 915   1352 641
1     1  American Airlines Inc.  JFK ORD 1815  1610 641
1     1  JetBlue Airways         JFK ORD 917   1735 641
1     1  Comair Inc.             JFK ORD 5469  1749 641
1     1  Comair Inc.             JFK ORD 5492  2000 641
1     1  JetBlue Airways         JFK ORD 919   2102 641
1     31 JetBlue Airways         JFK ORD 919   48   48
1     31 JetBlue Airways         JFK ORD 903   635  48
1     31 Comair Inc.             JFK ORD 5447  650  48
1     31 American Airlines Inc.  JFK ORD 1323  840  48
1     31 JetBlue Airways         JFK ORD 907   921  48
1     31 JetBlue Airways         JFK ORD 917   1859 48

In Step (A), the GROUP BY clause was replaced with the OVER clause where you specify the PARTITION or window over which you want the MIN aggregate function to operate. Also included is the ORDER BY clause so that you can see those subsequent flights after the first one.

As you can see from the listing, on January 31, JetBlue has a nice, early flight at 12:48 a.m. —opt for a later one, at 6:35 a.m. Early-riser issues aside, note that you have retained the information in the query output that would have been lost if you had chosen to use a GROUP BY clause again.

This capability alone makes windowing a powerful feature, and there’s more. Along with windowing in the Hive 0.11 release, the community provided some analytics functions that you can use in conjunction with windowing. Also at your disposal are these functions: RANK, ROW_NUMBER, DENSE_RANK, CUME_DIST, PERCENT_RANK, and NTILE.

Finally, don’t miss the use of JOIN: It’s a real-life, practical example of an inner join in which you join the FlightInfo2008 table with the Carriers table to get the airline name — rather than the cryptic code found in the FlightInfo2008 table.