Sqoop Exports Using the Update and Update Insert Approach

By Dirk deRoos

With insert mode, records exported by Sqoop are appended to the end of the target table. Sqoop also provides an update mode that you can use by providing the –update-key <column(s)> command line argument. This action causes Sqoop to generate a SQL UPDATE statement to run on the RDBMS or data warehouse.

Assume that you want to update a three-column table with data stored in the HDFS file /user/my-hdfs-file. The file contains this data:

100, 1000, 2000

The following abbreviated Sqoop export command generates the corresponding SQL UPDATE statement on your database system:

$ sqoop export (Generic Arguments)
  --table target-relational-table 
  --update-key column1
  --export-dir /user/my-hdfs-file
Generates => UPDATE target-relational-table SET
             WHERE column1=100;

With the preceding export command, if the target-relational-table on your RDBMS or data warehouse system has no record with the matching value in column1, nothing is changed in target-relational-table.

However, you may also include another argument that inserts or appends your data to target-table if no matching records are found. Think of it this way: If exists UPDATE else INSERT.

This technique is often referred to as upsert in the database vernacular or as MERGE in other implementations. The argument for upsert mode is –update-mode <mode>, where updateonly is the default and allowinsert activates upsert mode. Check your database documentation or consult with your vender to determine whether upsert mode is supported with Apache Sqoop.