Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Sqoop Hive-import not deleting old data in warehouse

avatar
Explorer

Hello,

 

when importing data from DB2 via Sqoop to Hive the stored data in /warehouse/tablespace/managed/hive/databasename/tablename/ is steadily growing.

For every import (with --hive-import and --hive-overwrite set) there is a new folder: 

"base_000000n" created. Thus the parent folder is steadily growing. Any way to delete the old folders before importing new data with Sqoop?
 
regards
1 ACCEPTED SOLUTION

avatar
Super Guru

@Muffex    My recommendation for ingestion process is to always use staging/temporary tables which are managed separately from the master table the data needs to arrive in.   This allows you operate on the staging tables before or after those results are added to the master table w/o effecting the master table.    In your use case, your ingestion process would sqoop to temp,  insert from temp to master table, then drop temp location.    In some of my past implementations of this manner, the temp tables were organized hourly, and they stay active for at least 7 days before a decoupled cleanup job removes anything 7 days old.   This idea was done for auditing purposes, but normally I would create and destroy the data during the ingestion procedure.

 

 

If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.  

 

Thanks,


Steven @ DFHZ

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Muffex    My recommendation for ingestion process is to always use staging/temporary tables which are managed separately from the master table the data needs to arrive in.   This allows you operate on the staging tables before or after those results are added to the master table w/o effecting the master table.    In your use case, your ingestion process would sqoop to temp,  insert from temp to master table, then drop temp location.    In some of my past implementations of this manner, the temp tables were organized hourly, and they stay active for at least 7 days before a decoupled cleanup job removes anything 7 days old.   This idea was done for auditing purposes, but normally I would create and destroy the data during the ingestion procedure.

 

 

If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.  

 

Thanks,


Steven @ DFHZ

avatar
Explorer

@stevenmatison Thanks for your answer. As my tables are relatively small and only used to duplicate existing data - is there any way to remove the existing folders before importing new data?
regards