The Principles of Sqoop Design

By Dirk deRoos

When it comes to Sqoop, a picture is often worth a thousand words, so check out the figure, which gives you a bird’s-eye view of the Sqoop architecture.

image0.jpg

The idea behind Sqoop is that it leverages map tasks — tasks that perform the parallel import and export of relational database tables — right from within the Hadoop MapReduce framework. This is good news because the MapReduce framework provides fault tolerance for import and export jobs along with parallel processing!

You’ll appreciate the fault tolerance if there is a failure during a large table import or export because the MapReduce framework will recover without requiring you to start the process all over again.

Sqoop can import data to Hive and HBase. Note, however, that the arrows to Hive and HBase point in only one direction. Data stored in any relational database with JDBC support can be directly imported into the Hive or HBase systems with Sqoop. Exports, however, are performed from data stored in HDFS.

Therefore, if you need to export your Hive tables, you point Sqoop to HDFS directories that store your Hive tables. If you need to export HBase tables, you first have to export them to HDFS and then execute the Sqoop export command.