Created 12-03-2017 01:14 AM
Mode: lastmodified
mysql> describe orders; +-------------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key |Default | Extra | +-------------------+-------------+------+-----+---------+----------------+ | order_id | int(11) | NO | PRI | NULL | auto_increment | | order_date | datetime | NO || NULL | | | order_customer_id | int(11) | NO | | NULL | | | order_status | varchar(45) | NO | | NULL | | +-------------------+-------------+------+-----+---------+----------------+ 4 rows in set (0.00 sec)
Import the order into hdfs
sqoop import \ --connect jdbc:mysql://quickstart:3306/retail_db \ --username retail_dba \ --password cloudera \ --table orders \ --split-by order_id \ --target-dir /user/sqoop/orders \ --as-textfile
After import
[cloudera@quickstart lib]$ hadoop fs -ls -R /user/sqoop/orders -rw-r--r-- 1 cloudera supergroup 0 2017-12-02 16:01 /user/sqoop/orders/_SUCCESS -rw-r--r-- 1 cloudera supergroup 741597 2017-12-02 16:01 /user/sqoop/orders/part-m-00000 -rw-r--r-- 1 cloudera supergroup 753022 2017-12-02 16:01 /user/sqoop/orders/part-m-00001 -rw-r--r-- 1 cloudera supergroup 752368 2017-12-02 16:01 /user/sqoop/orders/part-m-00002 -rw-r--r-- 1 cloudera supergroup 752940 2017-12-02 16:01 /user/sqoop/orders/part-m-00003
Update order data
mysql> select * from orders where order_id=10; +----------+---------------------+-------------------+-----------------+ | order_id | order_date | order_customer_id | order_status | +----------+---------------------+-------------------+-----------------+ | 10 | 2013-07-25 00:00:00 | 5648 | PENDING_PAYMENT | +----------+---------------------+-------------------+-----------------+ 1 row in set (0.00 sec)mysql> update orders set order_status='CLOSED', order_date=now() where order_id=10; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0mysql> select * from orders where order_id=10; +----------+---------------------+-------------------+--------------+ | order_id | order_date | order_customer_id | order_status | +----------+---------------------+-------------------+--------------+ | 10 | 2017-12-02 16:19:23 | 5648 | CLOSED | +----------+---------------------+-------------------+--------------+ 1 row in set (0.00 sec)
Import additional data
sqoop import \ --connect jdbc:mysql://quickstart:3306/retail_db \ --username retail_dba \ --password cloudera \ --table orders \ --split-by order_id \ --check-column order_date \ --merge-key order_id \ --incremental lastmodified \ --target-dir /user/sqoop/orders \ --as-textfile
Output:
[cloudera@quickstart lib]$ hadoop fs -ls -R /user/sqoop/orders -rw-r--r-- 1 cloudera cloudera 0 2017-12-02 16:07 /user/sqoop/orders/_SUCCESS-rw-r--r-- 1 cloudera cloudera 2999918 2017-12-02 16:07 /user/sqoop/orders/part-r-00000
Question: Old file in HDFS directory (/user/sqoop/orders/part-m-00000-3)got delete. If it is incremental import then why scoop is deleting old files?
Created 12-03-2017 02:13 AM
In your sqoop import you are using
--merge-key order_id
when ever you use merge key argument in sqoop import runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir
.
Sqoop will compare the new data with the old existing data(part-m-000003), if the same Order_id record presented then sqoop will merge the data with newly imported data and
Creates new part file in the target directory.
You can see part-r-00000 file got created after sqoop import with merge key, part-r means reducer has created the file.
When we do sqoop import with out merge key argument the directory names in target dir are like part-m means mapper created the file.
This is an expected behavior from sqoop when we specify merge key argument compares the existing data with the new data, if same order_id found it will merge the data with new data in reducer phase and Creates new part-r file by removing all the existing files(because sqoop has checked if there is any order_id presented in the existing part-m files and creates new part-r file).
Created 12-03-2017 02:13 AM
In your sqoop import you are using
--merge-key order_id
when ever you use merge key argument in sqoop import runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir
.
Sqoop will compare the new data with the old existing data(part-m-000003), if the same Order_id record presented then sqoop will merge the data with newly imported data and
Creates new part file in the target directory.
You can see part-r-00000 file got created after sqoop import with merge key, part-r means reducer has created the file.
When we do sqoop import with out merge key argument the directory names in target dir are like part-m means mapper created the file.
This is an expected behavior from sqoop when we specify merge key argument compares the existing data with the new data, if same order_id found it will merge the data with new data in reducer phase and Creates new part-r file by removing all the existing files(because sqoop has checked if there is any order_id presented in the existing part-m files and creates new part-r file).
Created 12-03-2017 04:01 AM
Hi, I add the merge condition to overcome the below error. Also I can't use last-value as I don't know the value.
ERROR tool.ImportTool: Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.
Created 12-03-2017 03:52 PM
If you don't want to pass last value on every run, then create sqoop job as sqoop job will store the last value in metastore and pass the last value when we run the job again.
--incremental lastmodified:-
Creating sqoop job:-
sqoop job --create test -- import --connect 'jdbc:mysql://quickstart:3306/retail_db' --username retail_dba --password cloudera --table orders --split-by order_id --target-dir /user/sqoop/orders --check-column order_date --merge-key order_id --incremental lastmodified --as-textfile
sqoop job --list //list out created sqoop jobs sqoop job --exec test //execute sqoop job sqoop job --delete test //delete sqoop job
--incremental append:-
sqoop job --create test -- import --connect 'jdbc:mysql://quickstart:3306/retail_db' --username retail_dba --password cloudera --table orders --split-by order_id --target-dir /user/sqoop/orders --check-column order_date --incremental append --as-textfile
Note:-
Created 04-30-2018 06:29 PM
Hi,
The sqoop is running ok however facing an issue at the end:
Logging initialized using configuration in jar:file:/usr/hdp/2.6.4.0-91/hive/lib/hive-common-1.2.1000.2.6.4.0-91.jar!/hive-log4j.properties OK Time taken: 2.92 seconds FAILED: SemanticException Line 2:17 Invalid path ''hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table'': No files matching path hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table
where xyz is the unix user who is running the sqoop operation. target dir is different and I can see data getting loaded into the designated target-dir. Any one encountered this?
K