Importing Data with Sqoop - dummies

Importing Data with Sqoop

By Dirk deRoos

Ready to dive into importing data with Sqoop? Start by taking a look at the figure, which illustrates the steps in a typical Sqoop import operation from an RDBMS or a data warehouse system. Nothing too complicated here — just a typical Products data table from a (typical) fictional company being imported into a typical Apache Hadoop cluster from a typical data management system (DMS).

image0.jpg

During Step 1, Sqoop uses the appropriate connector to retrieve the Products table metadata from the target DMS. (The metadata is used to map the data types from the Products table to data types in the Java language.)

Step 2 then uses this metadata to generate and compile a Java class that will be used by one or more map tasks to import the actual rows from the Products table. Sqoop saves the generated Java class to temp space or to a directory you specify so that you can leverage it for the subsequent processing of your data records.

The Sqoop generated Java code that is saved for you is like the gift that keeps on giving! With this code, Sqoop imports records from the DMS and stores them to HDFS using one of three formats that you can pick: binary Avro data, binary sequence files, or delimited text files. Afterwards, this code is available to you for subsequent data processing.

Sequence files are a natural choice if you’re importing binary data types and you’ll need the generated Java class to serialize and deserialize your data later on — perhaps for MapReduce processing or exporting. Avro data — based on Apache’s own serialization framework — is useful if you need to interact with other applications after the import to HDFS.

If you choose to store your imported data in delimited text format, you may find the generated Java code valuable later on as you parse and perform data format conversions on your new data. You’ll see that the generated code also helps you merge data sets after Sqoop import operations, and the generated Java code can help avoid ambiguity when processing delimited text data.

Finally, during Step 3, Sqoop divides the data records in the Products table across a number of map tasks (with the number of mappers optionally specified by the user) and imports the table data into HDFS, Hive, or HBase.