Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

In Sqoop incremental lastmodified, getting duplicate records?

Highlighted

In Sqoop incremental lastmodified, getting duplicate records?

New Contributor

HI,

Im importing table from DB2 to hcatalog with lastmodified option in ORC formate, Sometimes I am getting duplicate records, some tables are iporting properly, but soe tbales are getting dupliates,What might be the problem?

Thank U

4 REPLIES 4

Re: In Sqoop incremental lastmodified, getting duplicate records?

Hi @Ravikiran Dasari!
Could you share your sqoop call?
Btw, not sure it's your case, but, once I had a similar problem, got that solved by passing the correct timestamp to the --last-value.
Also, could you confirm if these "duplicate records" appears in a determined window of timestamps or #number of mappers?
E.g.
"Only happens if I set more than 4 (default) mappers and the timestamp for the duplicated records are close to the value of --last-value. "

Re: In Sqoop incremental lastmodified, getting duplicate records?

New Contributor

Hi @Vinicius Higa Murakami,

Thanks for response..

--last-modifird value will take it from Sqoop job only,if I give manually there wont be any issue,In my source DB new records will add at 2018-07-29 01:20:08 and my sqoop import has run at 2018-07-29 15:10:08.234980.And again my source import will be at 2018-07-30 01:30:08 and my sqoop import will run at 2018-07-30 14:10:08.234980, this it will import 2018-07-29 source import records and 2018-07-30 import records also, and its not every time some times its importing 2018-07-30 import records only. My import statement is as follows

sqoop job --create PACKAGE_EVENT_AUTOBOOST_SETUP_AMOUNT_JOB -- import --options-file '/home/hdfs/sqoopimport/DBConnections/connectionDetails.txt' --password-file 'hdfs://ssehdp101.metmom.mmih.biz:8020/passwd/psw.txt' --table REPORT.PACKAGE_EVENT_AUTOBOOST --incremental lastmodified --check-column LOAD_AT -m 1 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database SNDPD --hcatalog-table report_PACKAGE --hcatalog-storage-stanza 'stored as orcfile'.

sqoop job --exec PACKGE_EVENT_AUTOBOOST_SETUP_AMOUNT_JOB

Re: In Sqoop incremental lastmodified, getting duplicate records?

Hello @Ravikiran Dasari!
Okay. Are these rows suffering from updates on the --check-column LOAD_AT? If so, they will be imported only if the value it's bigger than the --last-value or the value saved on the sqoop job, otherwise only new rows should be imported.
One thing that you can take a look is:

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_merge_literal
To merge your datasets to maintain the PK with last recent register :)
Hope this helps!

Re: In Sqoop incremental lastmodified, getting duplicate records?

New Contributor

Hi @Vinicius Higa Murakami,

There is no updates, I am getting full row duplicates ,I think its sqoop tool problem.And to make use of merge I dont have PK in table.

Sqoop import with last modifies is not giving consistence result.Some times its importing hole day records instead of importing records from last import .

Anyway Thanks a lot.