Importing Data into HDFS with Sqoop - dummies

Importing Data into HDFS with Sqoop

By Dirk deRoos

Imagine a relational database used by a fictional service company that has been taking (you guessed it) Apache Hadoop service calls and now wants to move some of its data onto Hadoop to run Hive queries, leverage HBase scalability and performance, and run text analytics on its customer’s problem descriptions.

image0.jpg

Sqoop is the tool you’ll want to use to import data from relational tables into HBase tables on Hadoop.

In the following listing, you can see the MySQL commands used to build the Service Order Database you see in the figure. Installed is a MySQL RDBMS that you could import from and export to using Sqoop.

/* Create the Service Orders Database */
CREATE DATABASE serviceorderdb;
USE serviceorderdb;
/* Create the Product Information Table */
CREATE TABLE productinfo(
productnum CHAR (4) PRIMARY KEY,
productdesc VARCHAR(100)
);
/* Create the Customer Contact Information Table */
CREATE TABLE customercontactinfo(
customernum INT PRIMARY KEY,
customername VARCHAR(100),
contactinfo VARCHAR(100),
productnums SET('A100','A200','A300','B400','B500','C500','C600','D700')
);
/* Create the Service Orders Table */
CREATE TABLE serviceorders(
serviceordernum INT PRIMARY KEY,
customernum INT,
productnum CHAR(4),
status VARCHAR(100),
FOREIGN KEY (customernum) REFERENCES customercontactinfo(customernum),
FOREIGN KEY (productnum) REFERENCES productinfo(productnum)
);
/* Insert product data into the Product Information Table */
INSERT INTO productinfo VALUES ('A100', 'HBase Support Product');
INSERT INTO productinfo VALUES ('A200', 'Hive Support Product');
INSERT INTO productinfo VALUES ('A300', 'Sqoop Support Product');
INSERT INTO productinfo VALUES ('B400', 'Ambari Support Product');
INSERT INTO productinfo VALUES ('B500', 'HDFS Support Product');
INSERT INTO productinfo VALUES ('C500', 'Mahout Support Product');
INSERT INTO productinfo VALUES ('C600', 'Zookeeper Support Product');
INSERT INTO productinfo VALUES ('D700', 'Pig Support Product');
/* Insert customer data into the Customer Contact Information Table */
INSERT INTO customercontactinfo 
VALUES (10000, 'John Timothy Smith', '1 Hadoop Lane, NY, 11111, 
        John.Smith@xyz.com', 'B500');
INSERT INTO customercontactinfo 
VALUES (10001, 'Bill Jones', '2 HBase Ave, CA, 22222', 
        'A100,A200,A300,B400,B500,C500,C600,D700');
INSERT INTO customercontactinfo 
VALUES (20000, 'Jane Ann Doe', '1 Expert HBase Ave, CA, 22222', 
        'A100,A200,A300');
INSERT INTO customercontactinfo 
VALUES (20001, 'Joe Developer', '1 Piglatin Ave, CO, 33333', 'D700');
INSERT INTO customercontactinfo 
VALUES (30000, 'Data Scientist', '1 Statistics Lane, MA, 33333', 'A300,C500');
/* Enter service orders into the Service Orders Table */
INSERT INTO serviceorders 
VALUES (100000, 20000, 'A200', 'I have some questions on building HiveQL queries? My Hadoop for Dummies book has not arrived yet!');
INSERT INTO serviceorders 
VALUES (100001, 10001, 'A100', 'I need to understand how to configure Zookeeper for my HBase Cluster?');
INSERT INTO serviceorders 
VALUES (200000, 20001, 'D700', 'I am writing some Piglatin and I have a few questions?');
INSERT INTO serviceorders 
VALUES (200001, 30000, 'A300', 'How do I merge my data sets after Sqoop incremental imports?');

The following listing confirms that the MySQL Service Order Database has been created using the commands shown previously, and shows you the table names that you’ll import from using Sqoop.

mysql> show tables;
+--------------------------+
| Tables_in_serviceorderdb |
+--------------------------+
| customercontactinfo      |
| productinfo              |
| serviceorders            |
+--------------------------+
3 rows in set (0.00 sec)

Now that you have seen the MySQL Service Order Database records that are just waiting to be exploited, it’s time to turn your attention to Hadoop and run your first Sqoop command.

Don’t pull out the trusty import command right off the bat. Sqoop includes several handy tools along with import and export, including the list-databases command, which is used in the following listing. Using that command, you can confirm that you have connectivity and visibility into the MySQL database.

$ sqoop list-databases --connect jdbc:mysql://localhost/ 
                       --username root -P
Enter password:
13/08/15 17:21:00 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
mysql
performance_schema
serviceorderdb

The serviceorderdb is shown to be available, so now you can list the tables within serviceorderdb by using the Sqoop list-tables command. Notice that now you’re adding the database that you want Sqoop to access in the jdbc:mysql URL:

$ sqoop list-tables 
         --connect jdbc:mysql://localhost/serviceorderdb 
         --username root -P
Enter password:
13/08/15 17:22:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
customercontactinfo
productinfo
serviceorders

Sqoop now has connectivity and can access the three tables from the figure. That means you can execute your first Sqoop import command and target the serviceorders table with a clean conscience. Sqoop import commands have this format:

sqoop import (generic arguments) (import arguments)

With the generic arguments, you point to your MySQL database and provide the necessary login information, just as you did with the preceding list-tables tool. In the import arguments, you (the user) have the ability to specify what you want to import and how you want the import to be performed.

In the following listing, you specify the serviceorders table and request that one map task be used for the import using the -m 1 CLA. (By default, Sqoop would use four map tasks, but that would be overkill for this small table and our virtual machine.)

You have also specified the –class-name for the generated code and specified the –bindir where the compiled code and .jar file should be located. (Without these arguments, Sqoop would place the generated Java source file in your current working directory and the compiled .class file and .jar file in /tmp/sqoop-<username>/compile.)

The class name simply derives from the table name unless you specify a name with the help of the –class-name command line argument (CLA). The –target-dir is the location in HDFS where you want the imported table to be placed.

$ sqoop import 
  --connect jdbc:mysql://localhost/serviceorderdb  
  --username root -P 
  --table serviceorders -m 1 
  --class-name serviceorders 
  --target-dir /usr/biadmin/serviceorders-import 
  --bindir .
Enter password:
...
13/08/25 14:43:56 INFO mapreduce.ImportJobBase: Transferred 356 bytes in 21.0736 seconds (16.8932 bytes/sec)
13/08/25 14:43:56 INFO mapreduce.ImportJobBase: Retrieved 4 records.