Hi, I am trying to understand the Merge option in sqoop, but unable to comprehend the same. My basic understanding about the merge option is that it can update the records based on the id key.
create table customer (id int, name varchar(100), recorddt date); insert into customer values (1,'abc','2016-01-01'); insert into customer values (2,'def','2016-01-02'); After this extracted the records using incremental-append option. Both the records are extracted. sqoop import \ --connect jdbc:mysql://localhost/training_db \ --username root \ --password cloudera \ --table customer \ --target-dir '/user/cloudera/kbk/sqoop1' \ --incremental append \ --check-column id \ --last-value 0 \ --m 1 Output of the above script [cloudera@quickstart sqoop_scripts]$ hadoop fs -ls /user/cloudera/kbk/sqoop1 Found 1 items -rw-r--r-- 1 cloudera cloudera 34 2017-01-01 05:49 /user/cloudera/kbk/sqoop1/part-m-00000 [cloudera@quickstart sqoop_scripts]$ hadoop fs -cat /user/cloudera/kbk/sqoop1/part-m-00000 1,abc,2016-01-01 2,def,2016-01-02 [cloudera@quickstart sqoop_scripts]$
Update one record
update customer set name = 'abcxyz', recorddt = '2016-01-04' where id = 1; mysql> select * from customer; +------+--------+------------+ | id | name | recorddt | +------+--------+------------+ | 1 | abcxyz | 2016-01-04 | | 2 | def | 2016-01-02 | +------+--------+------------+
After this, extracted the record using incremental-lastmodified option using the below script.
sqoop import \ --connect jdbc:mysql://localhost/training_db \ --username root \ --password cloudera \ --table customer \ --target-dir '/user/cloudera/kbk/sqoop2' \ --incremental lastmodified \ --check-column recorddt \ --last-value '2016-01-03' \ --merge-key id \ --m 1
In the above script for the target-dir, if I use the same location as that of the previous script, the script throws an error telling that the location exists and the re-runs and generates a part-r-00000 file with all the records. So changed the location to a new path and executed the same.
17/01/01 05:54:48 INFO mapreduce.ImportJobBase: Transferred 20 bytes in 31.8766 seconds (0.6274 bytes/sec) 17/01/01 05:54:48 INFO mapreduce.ImportJobBase: Retrieved 1 records. 17/01/01 05:54:48 INFO tool.ImportTool: Final destination exists, will run merge job. 17/01/01 05:54:48 INFO tool.ImportTool: Moving data from temporary directory _sqoop/be32a583f44e47ebbde872e58513d40f_customer to final destination /user/cloudera/kbk/sqoop2 17/01/01 05:54:48 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments: 17/01/01 05:54:48 INFO tool.ImportTool: --incremental lastmodified 17/01/01 05:54:48 INFO tool.ImportTool: --check-column recorddt 17/01/01 05:54:48 INFO tool.ImportTool: --last-value 2017-01-01 05:54:16.0 17/01/01 05:54:48 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create') Output of the above script [cloudera@quickstart sqoop_scripts]$ hadoop fs -ls /user/cloudera/kbk/sqoop2 Found 2 items -rw-r--r-- 1 cloudera cloudera 0 2017-01-01 05:54 /user/cloudera/kbk/sqoop2/_SUCCESS -rw-r--r-- 1 cloudera cloudera 20 2017-01-01 05:54 /user/cloudera/kbk/sqoop2/part-m-00000 [cloudera@quickstart sqoop_scripts]$ hadoop fs -cat /user/cloudera/kbk/sqoop2/part-m-00000 1,abcxyz,2016-01-04
I am able to understand the incremental option and but not able to clearly catch how this merge works. I am clearly missing something to fully understand this. What am I missing to understand how the merge works (the record is updated for the record with id as 1).
Hi @Bharath Kumar K , With sqoop merge you will merge multiple data sets. You will specify a --merge-key (usually primary key). Sqoop will run a map-reduce job to get last updated row based on --merge-key.
Hey Ajay, I have 2 questions below,
1. without merge key option, what happens?
2. If I use previous target location, will it auto merge even though it errors by saying directory exists.
You are using the Merge functionality wrong! You are doing an incremental import in your second sqoop job and not a merge!
Now since you have two imports, let's talk about merge functionality.
Your older dataset - /user/cloudera/kbk/sqoop1
Your newer dataset - /user/cloudera/kbk/sqoop2
What should be Sqoop job for merging them together?
sqoop merge \ --new-data /user/cloudera/kbk/sqoop2 \ --onto /user/cloudera/kbk/sqoop1 \ --target-dir /user/cloudera/kbk/sqoop_merged \ --jar-file datatypes.jar \ --class-name Foo \ --merge-key id
Here, --class-name is the Jar file generated in your "outdir" when you did the import. If this is not available you can recreate the class using the
When will you run the aforementioned Sqoop Job, your data from older and newer datasets will be merged together based on your id column.
Refer to Sqoop documentation for further details and let know if you need any other help!
I am not able to understand the use of sqoop merge .All I can understand is you are comparing the old dataset with new dataset and then replacing the new file with old files . My question is if its just about updating the old with new then why cant we import full table? Why to use sqoop merge ?
The sqoop is running ok however facing an issue at the end:
Logging initialized using configuration in jar:file:/usr/hdp/220.127.116.11-91/hive/lib/hive-common-1.2.1000.2.6.4.0-91.jar!/hive-log4j.properties OK Time taken: 2.92 seconds FAILED: SemanticException Line 2:17 Invalid path ''hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table'': No files matching path hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table
where xyz is the unix user who is running the sqoop operation. target dir is different and I can see data getting loaded into the designated target-dir. Any one encountered this?