Defining Table Record Formats in Hive

By Dirk deRoos

The Java technology that Hive uses to process records and map them to column data types in Hive tables is called SerDe, which is short for SerializerDeserializer. The figure illustrates how SerDes are leveraged and it will help you understand how Hive keeps file formats separate from record formats.

image0.jpg

So the first thing to notice is the INPUTFORMAT object. INPUTFORMAT allows you to specify your own Java class should you want Hive to read from a different file format. STORED AS TEXTFILE is easier than writing INPUTFORMAT org.apache.hadoop.mapred.TextInputFormat — the whole Java package tree and class name for the default text file input format object, in other words.

The same is true of the OUTPUTFORMAT object. Instead of writing out the whole Java package tree and class name, the STORED AS TEXTFILE statement takes care of all of that for you.

Hive allows you to separate your record format from your file format, so how exactly do you accomplish this? Simple, you either replace STORED AS TEXTFILE with something like STORED AS RCFILE, or you can create your own Java class and specify the input and output classes using INPUTFORMAT packagepath.classname and OUTPUTFORMAT packagepath.classname.

Finally notice that when Hive is reading data from the HDFS (or local file system), a Java Deserializer formats the data into a record that maps to table column data types. This would characterize the data flow for a HiveQL SELECT statement. When Hive is writing data, a Java Serializer accepts the record Hive uses and translates it such that the OUTPUTFORMAT class can write it to the HDFS (or local file system).

This would characterize the data flow for a HiveQL CREATE-TABLE-AS-SELECT statement. So the INPUTFORMAT, OUTPUTFORMAT and SerDe objects allow Hive to separate the table record format from the table file format.

Hive bundles a number of SerDes for you to choose from, and you’ll find a larger number available from third parties if you search online. You can also develop your own SerDes if you have a more unusual data type that you want to manage with a Hive table. (Possible examples here are video data and e-mail data.)

In the following list, some of the SerDes provided with Hive are described as well as one third-party option that you may find useful.

  • LazySimpleSerDe: The default SerDe that’s used with the TEXTFILE format; it would be used with our_first_table from the following listing.

    (A) $ $HIVE_HOME/bin hive --service cli
    (B) hive> set hive.cli.print.current.db=true;
    (C) hive (default)> CREATE DATABASE ourfirstdatabase;
    OK
    Time taken: 3.756 seconds
    (D) hive (default)> USE ourfirstdatabase;
    OK
    Time taken: 0.039 seconds
    (E) hive (ourfirstdatabase)> CREATE TABLE our_first_table (
                           > FirstName       STRING,
                           > LastName        STRING,
                           > EmployeeId      INT);
    OK
    Time taken: 0.043 seconds
    hive (ourfirstdatabase)> quit;
    (F) $ ls /home/biadmin/Hive/warehouse/ourfirstdatabase.db
    our_first_table

    It also would be used with data_types_table from the following listing.

    $ ./hive --service cli
    hive> CREATE DATABASE data_types_db;
    OK
    Time taken: 0.119 seconds
    hive> USE data_types_db;
    OK
    Time taken: 0.018 seconds
    (1)Hive> CREATE TABLE data_types_table (
    (2) > our_tinyint     TINYINT   COMMENT '1 byte signed integer',
    (3) > our_smallint    SMALLINT  COMMENT '2 byte signed integer',
    (4) > our_int         INT       COMMENT '4 byte signed integer',
    (5) > our_bigint      BIGINT    COMMENT '8 byte signed integer',
    (6) > our_float       FLOAT     COMMENT 'Single precision floating point',
    (7) > our_double      DOUBLE    COMMENT 'Double precision floating point',
    (8) > our_decimal     DECIMAL   COMMENT 'Precise decimal type based
    (9) >                                      on Java BigDecimal Object',
    (10) > our_timestamp   TIMESTAMP COMMENT 'YYYY-MM-DD HH:MM:SS.fffffffff"
    (11) >                                     (9 decimal place precision)',
    (12) > our_boolean     BOOLEAN   COMMENT 'TRUE or FALSE boolean data type',
    (13) > our_string      STRING    COMMENT 'Character String data type',
    (14) > our_binary      BINARY    COMMENT 'Data Type for Storing arbitrary
    (15) >                                      number of bytes',
    (16) > our_array       ARRAY<TINYINT>  COMMENT 'A collection of fields all of
    (17) >                                 the same data type indexed BY
    (18) >                                 an integer',
    (19) > our_map       MAP<STRING,INT> COMMENT 'A Collection of Key,Value Pairs
    (20) >                                   where the Key is a Primitive
    (21) >                                   Type and the Value can be
    (22) >                                   anything.  The chosen data
    (23) >                                   types for the keys and values
    (24) >                                   must remain the same per map',
    (25) > our_struct    STRUCT<first : SMALLINT, second : FLOAT, third : STRING>
    (26) >                               COMMENT 'A nested complex data
    (27) >                                    structure',
    (28) > our_union      UNIONTYPE<INT,FLOAT,STRING>
    (29) >                             COMMENT 'A Complex Data Type that can
    (30) >                                    hold One of its Possible Data
    (31) >                                    Types at Once')
    (32) > COMMENT 'Table illustrating all Apache Hive data types'
    (33) > ROW FORMAT DELIMITED
    (34) > FIELDS TERMINATED BY ','
    (35) > COLLECTION ITEMS TERMINATED BY '|'
    (36) > MAP KEYS TERMINATED BY '^'
    (37) > LINES TERMINATED BY 'n'
    (38) > STORED AS TEXTFILE
    (39) > TBLPROPERTIES ('creator'='Bruce Brown', 'created_at'='Sat Sep 21 20:46:32 EDT 2013');
    OK
    Time taken: 0.886 seconds
  • ColumnarSerDe: Used with the RCFILE format.

  • RegexSerDe: The regular expression SerDe, which ships with Hive to enable the parsing of text files, RegexSerDe can form a powerful approach for building structured data in Hive tables from unstructured blogs, semi-structured log files, e-mails, tweets, and other data from social media. Regular expressions allow you to extract meaningful information (an e-mail address, for example) with HiveQL from an unstructured or semi-structured text document incompatible with traditional SQL and RDBMSs.

  • HBaseSerDe: Included with Hive to enables it to integrate with HBase. You can store Hive tables in HBase by leveraging this SerDe.

  • JSONSerDe: A third-party SerDe for reading and writing JSON data records with Hive. You can quickly find (via Google and GitHub) two JSON SerDes by searching online for the phrase json serde for hive.

  • AvroSerDe: Included with Hive so that you can read and write Avro data in Hive tables.

Reviewing the Language Manual DDL can be very helpful before you start creating your tables.