Support Questions
Find answers, ask questions, and share your expertise

Duplicate data in incremental load from sql to hive


Hi all,

When i tried to do incremental load from sql to hive i'm getting duplicate data. I have only 2 rows in Sql, if i run this code again i got another 2 rows in Hive i.e,. duplicate occurred.

Please provide me solutions.

sqoop import --connect 'jdbc:sqlserver://XXX.XXX;database=PTWTarget' --username un --password Admin --table ActionItems --incremental lastmodified --check-column Lastupdateddate --merge-key ActionId --num-mappers 1 --hive-import --hive-table project_tracking.actionitems2


New Contributor

Hi @Nethaji R,

In your sqoop command, --last-value attribute is missing. Use that attribute so you will not have duplicate data in your hive table. For better use of incremental query you can use sqoop job so everytime you don't have to remember last-value.

Here is the code what I used to solve your question.

sqoop import \
--connect 'connectionUrl' \
--username 'username' \
--password 'password' \
--table 'tableName' \
--incremental lastmodified \
--check-column 'columnName' \
--last-value "columnValue" \
--merge-key "mergeKeyColumnName" \
--hive-import \
--hive-table "hive-database.hive-table"

Let me know if it's not clear.



sqoop jobs



I find the solution, now there is no duplicate data occurred.

Thank you

sqoop import --connect 'jdbc:sqlserver://XX.XXX;database=XXXX' --username XX --password XXXX --table ActionItems --split-by ActionId --target-dir /user/hive/warehouse/project_tracking.db/actionitems --incremental append --check-column Lastupdateddate --last-value "2019-02-06 19:52:55.873"
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.