Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop Merge

Sqoop Merge

New Contributor

Hi, I am trying to understand the Merge option in sqoop, but unable to comprehend the same. My basic understanding about the merge option is that it can update the records based on the id key.

create table customer (id int, name varchar(100), recorddt date);

insert into customer values (1,'abc','2016-01-01');
insert into customer values (2,'def','2016-01-02');

After this extracted the records using incremental-append option. Both the records are extracted.
sqoop import \
--connect jdbc:mysql://localhost/training_db  \
--username root \
--password cloudera \
--table customer \
--target-dir '/user/cloudera/kbk/sqoop1' \
--incremental append \
--check-column id \
--last-value 0 \
--m 1

Output of the  above script
[cloudera@quickstart sqoop_scripts]$ hadoop fs -ls /user/cloudera/kbk/sqoop1
Found 1 items
-rw-r--r--  1 cloudera cloudera  34 2017-01-01 05:49 /user/cloudera/kbk/sqoop1/part-m-00000
[cloudera@quickstart sqoop_scripts]$ hadoop fs -cat /user/cloudera/kbk/sqoop1/part-m-00000
1,abc,2016-01-01
2,def,2016-01-02
[cloudera@quickstart sqoop_scripts]$ 

Update one record

update customer set name = 'abcxyz', recorddt = '2016-01-04' where id = 1;

mysql> select * from customer;
+------+--------+------------+
| id    | name  | recorddt  |
+------+--------+------------+
|  1 | abcxyz | 2016-01-04 |
|  2 | def  | 2016-01-02 |
+------+--------+------------+

After this, extracted the record using incremental-lastmodified option using the below script.

sqoop import \
--connect jdbc:mysql://localhost/training_db \
--username root \
--password cloudera \
--table customer \
--target-dir '/user/cloudera/kbk/sqoop2' \
--incremental lastmodified \
--check-column recorddt \
--last-value '2016-01-03' \
--merge-key id \
--m 1

In the above script for the target-dir, if I use the same location as that of the previous script, the script throws an error telling that the location exists and the re-runs and generates a part-r-00000 file with all the records. So changed the location to a new path and executed the same.

17/01/01 05:54:48 INFO mapreduce.ImportJobBase: Transferred 20 bytes in 31.8766 seconds (0.6274 bytes/sec)
17/01/01 05:54:48 INFO mapreduce.ImportJobBase: Retrieved 1 records.
17/01/01 05:54:48 INFO tool.ImportTool: Final destination exists, will run merge job.
17/01/01
 05:54:48 INFO tool.ImportTool: Moving data from temporary directory 
_sqoop/be32a583f44e47ebbde872e58513d40f_customer to final destination 
/user/cloudera/kbk/sqoop2
17/01/01 05:54:48 INFO tool.ImportTool: 
Incremental import complete! To run another incremental import of all 
data following this import, supply the following arguments:
17/01/01 05:54:48 INFO tool.ImportTool:  --incremental lastmodified
17/01/01 05:54:48 INFO tool.ImportTool:  --check-column recorddt
17/01/01 05:54:48 INFO tool.ImportTool:  --last-value 2017-01-01 05:54:16.0
17/01/01 05:54:48 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')

Output of the above script

[cloudera@quickstart sqoop_scripts]$ hadoop fs -ls /user/cloudera/kbk/sqoop2
Found 2 items
-rw-r--r--  1 cloudera cloudera  0 2017-01-01 05:54 /user/cloudera/kbk/sqoop2/_SUCCESS
-rw-r--r--  1 cloudera cloudera  20 2017-01-01 05:54 /user/cloudera/kbk/sqoop2/part-m-00000
[cloudera@quickstart sqoop_scripts]$ hadoop fs -cat /user/cloudera/kbk/sqoop2/part-m-00000
1,abcxyz,2016-01-04

I am able to understand the incremental option and but not able to clearly catch how this merge works. I am clearly missing something to fully understand this. What am I missing to understand how the merge works (the record is updated for the record with id as 1).

8 REPLIES 8
Highlighted

Re: Sqoop Merge

Rising Star

Hi @Bharath Kumar K , With sqoop merge you will merge multiple data sets. You will specify a --merge-key (usually primary key). Sqoop will run a map-reduce job to get last updated row based on --merge-key.

Re: Sqoop Merge

New Contributor

@Ajay Hey Ajay, I have 2 questions below,

1. without merge key option, what happens?

2. If I use previous target location, will it auto merge even though it errors by saying directory exists.

Re: Sqoop Merge

New Contributor

Hey Ajay, I have 2 questions below,

1. without merge key option, what happens?

2. If I use previous target location, will it auto merge even though it errors by saying directory exists.

Re: Sqoop Merge

New Contributor

@Rahul Soni, can u please take a look?

Re: Sqoop Merge

@Bharath Kumar K

You are using the Merge functionality wrong! You are doing an incremental import in your second sqoop job and not a merge!

Now since you have two imports, let's talk about merge functionality.

Your older dataset - /user/cloudera/kbk/sqoop1

Your newer dataset - /user/cloudera/kbk/sqoop2

What should be Sqoop job for merging them together?

sqoop merge \
--new-data /user/cloudera/kbk/sqoop2 \
--onto /user/cloudera/kbk/sqoop1 \
--target-dir /user/cloudera/kbk/sqoop_merged \
--jar-file datatypes.jar \
--class-name Foo \
--merge-key id

Here, --class-name is the Jar file generated in your "outdir" when you did the import. If this is not available you can recreate the class using the codegen tool.

When will you run the aforementioned Sqoop Job, your data from older and newer datasets will be merged together based on your id column.

Refer to Sqoop documentation for further details and let know if you need any other help!

Cheers!

Re: Sqoop Merge

@Bharath Kumar K

Did the answer help in the resolution of your query? Please close the thread by marking the answer as Accepted!

Re: Sqoop Merge

New Contributor

I am not able to understand the use of sqoop merge .All I can understand is you are comparing the old dataset with new dataset and then replacing the new file with old files . My question is if its just about updating the old with new then why cant we import full table? Why to use sqoop merge ?

Re: Sqoop Merge

New Contributor

Hi,

The sqoop is running ok however facing an issue at the end:
Logging initialized using configuration in jar:file:/usr/hdp/2.6.4.0-91/hive/lib/hive-common-1.2.1000.2.6.4.0-91.jar!/hive-log4j.properties OK Time taken: 2.92 seconds FAILED: SemanticException Line 2:17 Invalid path ''hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table'': No files matching path hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table

where xyz is the unix user who is running the sqoop operation. target dir is different and I can see data getting loaded into the designated target-dir. Any one encountered this?

ST

Don't have an account?
Coming from Hortonworks? Activate your account here