Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

sqoop import --incremental lastmodified is deleting old data

avatar
New Contributor

Mode: lastmodified

mysql> describe orders;
+-------------------+-------------+------+-----+---------+----------------+
| Field | Type  | Null | Key |Default | Extra  |
+-------------------+-------------+------+-----+---------+----------------+
| order_id | int(11)  | NO  | PRI | NULL  | auto_increment |
| order_date | datetime  | NO  || NULL  |  |
| order_customer_id | int(11)  | NO |  | NULL  |  |
| order_status | varchar(45) | NO  |  | NULL |  |
+-------------------+-------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)  

Import the order into hdfs

sqoop import \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--split-by order_id  \
--target-dir /user/sqoop/orders \
--as-textfile

After import

[cloudera@quickstart lib]$ hadoop fs -ls -R
/user/sqoop/orders
-rw-r--r--  1 cloudera supergroup  0 2017-12-02 16:01
/user/sqoop/orders/_SUCCESS
-rw-r--r--  1 cloudera supergroup  741597 2017-12-02 16:01
/user/sqoop/orders/part-m-00000
-rw-r--r--  1 cloudera supergroup  753022 2017-12-02 16:01
/user/sqoop/orders/part-m-00001
-rw-r--r--  1 cloudera supergroup  752368 2017-12-02 16:01 /user/sqoop/orders/part-m-00002
-rw-r--r--  1 cloudera supergroup  752940 2017-12-02 16:01
/user/sqoop/orders/part-m-00003

Update order data

mysql> select * from orders where
order_id=10;
+----------+---------------------+-------------------+-----------------+
| order_id | order_date  |
order_customer_id | order_status 
|
+----------+---------------------+-------------------+-----------------+
|  10 | 2013-07-25 00:00:00 |  5648 | PENDING_PAYMENT
|
+----------+---------------------+-------------------+-----------------+
1 row in set (0.00 sec)mysql> update orders set order_status='CLOSED',
order_date=now() where order_id=10;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0mysql> select * from orders where
order_id=10;
+----------+---------------------+-------------------+--------------+
| order_id | order_date  |
order_customer_id | order_status |
+----------+---------------------+-------------------+--------------+
|  10 | 2017-12-02 16:19:23 |  5648 | CLOSED  |
+----------+---------------------+-------------------+--------------+
1 row in set (0.00 sec)

Import additional data

sqoop import \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--split-by order_id 
\
--check-column order_date \
--merge-key order_id \
--incremental lastmodified \
--target-dir /user/sqoop/orders \
--as-textfile

Output:

[cloudera@quickstart lib]$ hadoop fs -ls -R
/user/sqoop/orders
-rw-r--r--  1 cloudera cloudera  0 2017-12-02 16:07
/user/sqoop/orders/_SUCCESS-rw-r--r--  1 cloudera cloudera  2999918 2017-12-02 16:07
/user/sqoop/orders/part-r-00000

Question: Old file in HDFS directory (/user/sqoop/orders/part-m-00000-3)got delete. If it is incremental import then why scoop is deleting old files?

1 ACCEPTED SOLUTION

avatar
Master Guru
@VINAYAK DORNALA

In your sqoop import you are using

--merge-key order_id

when ever you use merge key argument in sqoop import runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir.

Sqoop will compare the new data with the old existing data(part-m-000003), if the same Order_id record presented then sqoop will merge the data with newly imported data and

Creates new part file in the target directory.

You can see part-r-00000 file got created after sqoop import with merge key, part-r means reducer has created the file.

When we do sqoop import with out merge key argument the directory names in target dir are like part-m means mapper created the file.

This is an expected behavior from sqoop when we specify merge key argument compares the existing data with the new data, if same order_id found it will merge the data with new data in reducer phase and Creates new part-r file by removing all the existing files(because sqoop has checked if there is any order_id presented in the existing part-m files and creates new part-r file).

View solution in original post

4 REPLIES 4

avatar
Master Guru
@VINAYAK DORNALA

In your sqoop import you are using

--merge-key order_id

when ever you use merge key argument in sqoop import runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir.

Sqoop will compare the new data with the old existing data(part-m-000003), if the same Order_id record presented then sqoop will merge the data with newly imported data and

Creates new part file in the target directory.

You can see part-r-00000 file got created after sqoop import with merge key, part-r means reducer has created the file.

When we do sqoop import with out merge key argument the directory names in target dir are like part-m means mapper created the file.

This is an expected behavior from sqoop when we specify merge key argument compares the existing data with the new data, if same order_id found it will merge the data with new data in reducer phase and Creates new part-r file by removing all the existing files(because sqoop has checked if there is any order_id presented in the existing part-m files and creates new part-r file).

avatar
New Contributor

@Shu

Hi, I add the merge condition to overcome the below error. Also I can't use last-value as I don't know the value.

ERROR tool.ImportTool: Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.

avatar
Master Guru

@VINAYAK DORNALA

If you don't want to pass last value on every run, then create sqoop job as sqoop job will store the last value in metastore and pass the last value when we run the job again.

--incremental lastmodified:-
Creating sqoop job:-

sqoop job --create test -- import --connect 'jdbc:mysql://quickstart:3306/retail_db' --username retail_dba --password cloudera --table orders --split-by order_id --target-dir /user/sqoop/orders --check-column order_date --merge-key order_id  --incremental lastmodified --as-textfile
sqoop job --list //list out created sqoop jobs
sqoop job --exec test //execute sqoop job
sqoop job --delete test //delete sqoop job
  • last modified mode works with merge-key, this mode compares the existing old data in the directory with the new data, If same merge key appears in the old data then sqoop merges that old data with the new data in reducer phase, then the new data after merge will be written to the target directory.
  • when we creates sqoop job then it will imports the data as incrementally for the first time run sqoop imports all the data and stores the last value in metastore and for the next run sqoop make use of metastore last value only imports newly added data only.
  • Lastmodified with merge-key argument will work with the existing target directory,
  • If we won't mention merger-key argument then sqoop will throw an error if the target directory already exists.

--incremental append:-

sqoop job --create test -- import --connect 'jdbc:mysql://quickstart:3306/retail_db' --username retail_dba --password cloudera --table orders --split-by order_id   --target-dir /user/sqoop/orders --check-column order_date  --incremental append --as-textfile
  • Append mode works with the existing directory(if directory not exists then creates directory), Stores the last state value in metastore.
  • For the second run of sqoop job only imports the newly added data after the last state value and creates new file in the target directory.
  • In this mode mapper is going to be initialized because we are not mentioning any merge key.

    You can decide which mode will be best fit for your use case.

Note:-

  • "--" and a space before the import creating sqoop job.
  • Use Options file to store credentials (user name, password and connect string) and pass it as as a parameter to --options-file.
    Creating options file in sqoop

avatar
New Contributor

Hi,

The sqoop is running ok however facing an issue at the end:
Logging initialized using configuration in jar:file:/usr/hdp/2.6.4.0-91/hive/lib/hive-common-1.2.1000.2.6.4.0-91.jar!/hive-log4j.properties OK Time taken: 2.92 seconds FAILED: SemanticException Line 2:17 Invalid path ''hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table'': No files matching path hdfs://hostname/user/xyz/_sqoop/55cc1038f2924cc398e5e014061eb0f2_sample_table

where xyz is the unix user who is running the sqoop operation. target dir is different and I can see data getting loaded into the designated target-dir. Any one encountered this?

K