Hive INSERT Command Examples - dummies

Hive INSERT Command Examples

By Dirk deRoos

One Hive DML command to explore is the INSERT command. You basically have three INSERT variants; two of them are shown in the following listing. To demonstrate this new DML command, you will create a new table that will hold a subset of the data in the FlightInfo2008 table.

(A) CREATE TABLE IF NOT EXISTS myFlightInfo (
  Year SMALLINT, DontQueryMonth TINYINT, DayofMonth TINYINT, DayOfWeek TINYINT,
  DepTime SMALLINT, ArrTime SMALLINT,
  UniqueCarrier STRING, FlightNum STRING,
  AirTime SMALLINT, ArrDelay SMALLINT, DepDelay SMALLINT,
  Origin STRING, Dest STRING, Cancelled SMALLINT,
  CancellationCode STRING)
COMMENT 'Flight InfoTable'
PARTITIONED BY(Month TINYINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
(B) STORED AS RCFILE
TBLPROPERTIES ('creator'='Bruce Brown', 'created_at'='Mon Sep  2 14:24:19 EDT 2013');
(C) INSERT OVERWRITE TABLE myflightinfo
  PARTITION (Month=1)
  SELECT Year, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, UniqueCarrier,
         FlightNum, AirTime, ArrDelay, DepDelay, Origin, Dest, Cancelled,
         CancellationCode
  FROM FlightInfo2008 WHERE Month=1;
(D) FROM FlightInfo2008
INSERT INTO TABLE myflightinfo
  PARTITION (Month=2)
  SELECT Year, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, UniqueCarrier, FlightNum,
  AirTime, ArrDelay, DepDelay, Origin, Dest, Cancelled, CancellationCode WHERE Month=2
... (Months 3 through 11 skipped for brevity)
INSERT INTO TABLE myflightinfo
  PARTITION (Month=12)
  SELECT Year, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, UniqueCarrier, FlightNum,
  AirTime, ArrDelay, DepDelay, Origin, Dest, Cancelled, CancellationCode WHERE Month=12;
(E) hive (flightdata)> SHOW PARTITIONS myflightinfo;
OK
month=1
month=10
month=11
month=12
...
month=9
(F) $ ls /home/biadmin/Hive/warehouse/flightdata.db/myflightinfo
month=1   month=11  month=2  month=4  month=6  month=8
month=10  month=12  month=3  month=5  month=7  month=9
(G) $HIVE_HOME/bin/hive --service rcfilecat
  /home/biadmin/Hive/warehouse/flightdata.db/myflightinfo/month=12/000000_0
...
2008    12      13      6       655     856     DL      1638    85      0       -5      PBI     ATL     0
2008    12      13      6       1251    1446    DL      1639    89      9       11      IAD     ATL     0
2008    12      13      6       1110    1413    DL      1641    104     -5      7       SAT     ATL     0

In Step (A), you create this new table and specify that the file format will be row columnar (Step (B)) instead of text. This format is more compact than text and often performs better, depending on your access patterns. (If you’re accessing a small subset of columns instead of entire rows, try the RCFILE format.)

The default SerDe for RCFILE format is the ColumnarSerDe. You can verify this fact by running the DESCRIBE EXTENDED myFlightInfo HiveQL command from the command line interface.

After creating the table, you use the INSERT OVERWRITE command [see Step (C)] to insert data via a SELECT statement from the FlightInfo2008 table. Note that you’re partitioning your data using the PARTITION keyword based on the Month field.

After you’re finished, you’ll have 12 table partitions, or actual directories, under the warehouse directory in the file system on your virtual machine, corresponding to the 12 months of the year. Partitioning can dramatically improve your query performance if you want to query data in the myFlightInfo table for only a certain month.

You can see the results of the PARTITION approach with the SHOW PARTITIONS command in Steps (E) and (F). Notice in Step (D) that you’re using a variant of the INSERT command to insert data into multiple partitions at one time. Only month 2 and 12 are shown for brevity but months 3 through 11 would have the same syntax.

Partitions are quite useful to the Hive programmer. However, it’s not uncommon to encounter a data set where partitioning could become unwieldy, especially if multiple partitions are specified [PARTITION BY(Country STRING, PersonName STRING), for example]. Twelve partitions are one thing — 7 billion partitions would be quite another!

The solution to partition sprawl is bucketing. Bucketing in Hive works by allowing you to specify some reasonable number of buckets, and then the system attempts to evenly distribute the data into the number of buckets you specify. [That could look something like PARTITION BY(…) CLUSTERED BY(BucketingColumn) INTO x BUCKETS.]

Additionally, this feature enables table sampling — a technique that allows Hive users to write queries on a sample of the data instead of the entire table. HiveQL table sampling can be very useful for big data analytics.

You can also use this FROM table1 INSERT INTO table2 SELECT… format to insert into multiple tables at a time. You use INSERT instead of OVERWRITE here to show the option of inserting instead of overwriting. Hive allows only appends, not inserts, into tables, so the INSERT keyword simply instructs Hive to append the data to the table.

Finally, note in Step (G) that you have to use a special Hive command service (rcfilecat) to view this table in your warehouse, because the RCFILE format is a binary format, unlike the previous TEXTFILE format examples.

The INSERT DML command has three variants. The third variant is the Dynamic Partition Inserts variant. In the listing, you partition the myFlightInfo table into 12 segments, 1 per month. If you had hundreds of partitions, this task would have become quite difficult, and it would have required scripting to get the job done.

Instead, Hive supports a technique for dynamically creating partitions with the INSERT OVERWRITE statement. So, if you find yourself needing to leverage table partitioning with a large, and possibly variable, number of partitions, check out the Dynamic Partition Inserts feature in the Hive DML Language Manual.