Importing Data into HBase with Sqoop

By Dirk deRoos

Sqoop can be used to transform a relational database schema into an HBase schema. Of course, the main goal here is to demonstrate how Sqoop can import data from an RDBMS or data warehouse directly into HBase, but it’s always better to see how a tool is used in context versus how it’s used in the abstract.

The figure shows how the Service Order Database might look after being transformed into an HBase schema.

image0.jpg

For this particular import example, you want to import the customercontactinfo table directly into an HBase table in preparation for building the HBase Service Order Database schema. To complete the HBase schema, you’d have to execute the same steps to import the productinfo table, and then the serviceorders table could be built with a Java MapReduce application.

Sqoop doesn’t now permit you to import, all at once, a relational table directly into an HBase table having multiple column families. To work around this limitation, you create the HBase table first and then execute three Sqoop import operations to finish the task. The listing shows the task of creating the table.

hbase(main):017:0> create 'customercontactinfo', 'CustomerName', 
hbase(main):018:0*        'ContactInfo', 'ProductNums'
0 row(s) in 1.0680 seconds

In the following listing, for each Sqoop import command, note that the target HBase column family specified by the –column-family CLA and the corresponding MySQL columns specified by the –columns CLA are in bold. The customernum primary key also becomes the HBase row key, as specified by the –hbase-row-key CLA.

$ sqoop import 
    --connect jdbc:mysql://localhost/serviceorderdb 
    --username root -P 
    --table customercontactinfo 
    --columns "customernum,customername" 
    --hbase-table customercontactinfo 
    --column-family CustomerName 
    --hbase-row-key customernum -m 1
Enter password:
...
13/08/17 16:53:01 INFO mapreduce.ImportJobBase: Retrieved 5 records.
$ sqoop import 
    --connect jdbc:mysql://localhost/serviceorderdb 
    --username root -P 
    --table customercontactinfo 
    --columns "customernum,contactinfo" 
    --hbase-table customercontactinfo 
    --column-family ContactInfo 
    --hbase-row-key customernum -m 1
Enter password:
...
13/08/17 17:00:59 INFO mapreduce.ImportJobBase: Retrieved 5 records.
$ sqoop import 
    --connect jdbc:mysql://localhost/serviceorderdb 
    --username root -P 
    --table customercontactinfo 
    --columns "customernum,productnums" 
    --hbase-table customercontactinfo 
    --column-family ProductNums 
    --hbase-row-key customernum -m 1
Enter password:
...
13/08/17 17:05:54 INFO mapreduce.ImportJobBase: Retrieved 5 records.

If you were to carry out an HBase scan of your new table, you’d see that the import and translation from a relational database table on MySQL directly into HBase was a success.

The customercontactinfo table in this example is rather small, but imagine the power you now have, using Sqoop and HBase, to quickly move relational tables that may be exceeding capacity on your RDBMS or data warehouse into HBase, where capacity is virtually unlimited and scalability is automatic.

hbase(main):033:0> scan 'customercontactinfo'
ROW    COLUMN+CELL
 10000    column=ContactInfo:contactinfo, timestamp=1376773256317, value=1 Hadoop Lane, NY, 11111, John.Smith@xyz.com
 10000    column=CustomerName:customername, timestamp=1376772776684, value=John Timothy Smith
 10000    column=ProductNums:productnums, timestamp=1376773551221, value=B500
 10001    column=ContactInfo:contactinfo, timestamp=1376773256317, value=2 HBase Ave, CA, 22222
 10001    column=CustomerName:customername, timestamp=1376772776684, value=Bill Jones
 10001    column=ProductNums:productnums, timestamp=1376773551221, value=A100,A200,A300,B400,B500,C500,C600,D700
 20000    column=ContactInfo:contactinfo, timestamp=1376773256317, value=1 Expert HBase Ave, CA, 22222
 20000    column=CustomerName:customername, timestamp=1376772776684, value=Jane Ann Doe
 20000    column=ProductNums:productnums, timestamp=1376773551221, value=A100,A200,A300
 20001    column=ContactInfo:contactinfo, timestamp=1376773256317, value=1 Piglatin Ave, CO, 33333
 20001    column=CustomerName:customername, timestamp=1376772776684, value=Joe Developer
 20001    column=ProductNums:productnums, timestamp=1376773551221, value=D700
 30000    column=ContactInfo:contactinfo, timestamp=1376773256317, value=1 Statistics Lane, MA, 33333
 30000    column=CustomerName:customername, timestamp=1376772776684, value=Data Scientist
 30000    column=ProductNums:productnums, timestamp=1376773551221, value=C500
5 row(s) in 0.1120 seconds

Importing existing relational data via Sqoop into Hive and HBase tables can potentially enable a wide range of new and exciting data analysis workflows. If this feature is of interest to you, check out the Apache Sqoop documentation for additional Hive and HBase command line arguments and features.