Im importing table from DB2 to hcatalog with lastmodified option in ORC formate, Sometimes I am getting duplicate records, some tables are iporting properly, but soe tbales are getting dupliates,What might be the problem?
Hi @Ravikiran Dasari!
Could you share your sqoop call?
Btw, not sure it's your case, but, once I had a similar problem, got that solved by passing the correct timestamp to the --last-value.
Also, could you confirm if these "duplicate records" appears in a determined window of timestamps or #number of mappers?
"Only happens if I set more than 4 (default) mappers and the timestamp for the duplicated records are close to the value of --last-value. "
Thanks for response..
--last-modifird value will take it from Sqoop job only,if I give manually there wont be any issue,In my source DB new records will add at 2018-07-29 01:20:08 and my sqoop import has run at 2018-07-29 15:10:08.234980.And again my source import will be at 2018-07-30 01:30:08 and my sqoop import will run at 2018-07-30 14:10:08.234980, this it will import 2018-07-29 source import records and 2018-07-30 import records also, and its not every time some times its importing 2018-07-30 import records only. My import statement is as follows
sqoop job --create PACKAGE_EVENT_AUTOBOOST_SETUP_AMOUNT_JOB -- import --options-file '/home/hdfs/sqoopimport/DBConnections/connectionDetails.txt' --password-file 'hdfs://ssehdp101.metmom.mmih.biz:8020/passwd/psw.txt' --table REPORT.PACKAGE_EVENT_AUTOBOOST --incremental lastmodified --check-column LOAD_AT -m 1 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database SNDPD --hcatalog-table report_PACKAGE --hcatalog-storage-stanza 'stored as orcfile'.
sqoop job --exec PACKGE_EVENT_AUTOBOOST_SETUP_AMOUNT_JOB
Hello @Ravikiran Dasari!
Okay. Are these rows suffering from updates on the --check-column LOAD_AT? If so, they will be imported only if the value it's bigger than the --last-value or the value saved on the sqoop job, otherwise only new rows should be imported.
One thing that you can take a look is:
To merge your datasets to maintain the PK with last recent register :)
Hope this helps!
There is no updates, I am getting full row duplicates ,I think its sqoop tool problem.And to make use of merge I dont have PK in table.
Sqoop import with last modifies is not giving consistence result.Some times its importing hole day records instead of importing records from last import .
Anyway Thanks a lot.