Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Incremental sqoop import into hive with tables having created date and modified date

avatar
Explorer

Hi All,

 

I am trying to achieve incremental sqoop import into hive when a new record is created as well when updates happen to the existing records. 

We have tables having both created date column and modified date column. But the modified date column will be updated only when there is any update in the existing records and will remain null when a new record is inserted. Only the created date will be updated when new records are created. 

 

So, when we try the incremental lastmodified option it always looks for the modified date column but our source tables doesnt have values in the modified date column always as it keeps NULL for new records and only updates the value when an existing table gets modified. 

 

So, for this kind of source DB tables with such columns is there any solution in sqoop import. 

 

Also, sqoop incremental with lastmodifed doesnt work with hive tables and we have to create a table already and then run this incremental lastmodified

 

"INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
--incremental lastmodified option for hive imports is not supported. Please remove the parameter --incremental lastmodified."

 

So, please help on which approach will suit better to do an incremental import of our data into Hive tables. 


Need a solution badly for this scenario. please help if you have done this before.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi @sass,

 

To use sqoop, you will need to combine the two columns, your CREATED_ON column and UPDATED_ON column. Unfortunately, sqoop is only able to look at one timestamp column as the control for incremental load. 

 

As for the error you are getting, that is expected due to SQOOP-2986. The alternative strategy is to do a regular sqoop import (without hive-import) and provide --target-dir to scoop that coincides with the Hive table location. 

 

 

Hope this helps,

Alex

View solution in original post

4 REPLIES 4

avatar
Master Collaborator

Hi @sass,

 

To use sqoop, you will need to combine the two columns, your CREATED_ON column and UPDATED_ON column. Unfortunately, sqoop is only able to look at one timestamp column as the control for incremental load. 

 

As for the error you are getting, that is expected due to SQOOP-2986. The alternative strategy is to do a regular sqoop import (without hive-import) and provide --target-dir to scoop that coincides with the Hive table location. 

 

 

Hope this helps,

Alex

avatar
Community Manager

@sass Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Explorer

Yes, I was doing both append and incremental last modified by giving warehouse directory which did not help in merging the records. But when I gave target directory it worked fine. Now, new records get updated on the hive table with incremental append script and updated records are entered into hive using lastmodified with updated_date but with target directory as parameter and without giving hive-import parameter. Thanks for suggesting it. That's what already worked, 

avatar
New Contributor

Is it possible for you to post your sample sqoop job? 

Thanks

@sass