Created on 08-27-2020 02:12 AM - edited 08-27-2020 02:13 AM
Hello,
when importing data from DB2 via Sqoop to Hive the stored data in /warehouse/tablespace/managed/hive/databasename/tablename/ is steadily growing.
For every import (with --hive-import and --hive-overwrite set) there is a new folder:
Created 08-27-2020 06:18 AM
@Muffex My recommendation for ingestion process is to always use staging/temporary tables which are managed separately from the master table the data needs to arrive in. This allows you operate on the staging tables before or after those results are added to the master table w/o effecting the master table. In your use case, your ingestion process would sqoop to temp, insert from temp to master table, then drop temp location. In some of my past implementations of this manner, the temp tables were organized hourly, and they stay active for at least 7 days before a decoupled cleanup job removes anything 7 days old. This idea was done for auditing purposes, but normally I would create and destroy the data during the ingestion procedure.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Thanks,
Steven @ DFHZ
Created 08-27-2020 06:18 AM
@Muffex My recommendation for ingestion process is to always use staging/temporary tables which are managed separately from the master table the data needs to arrive in. This allows you operate on the staging tables before or after those results are added to the master table w/o effecting the master table. In your use case, your ingestion process would sqoop to temp, insert from temp to master table, then drop temp location. In some of my past implementations of this manner, the temp tables were organized hourly, and they stay active for at least 7 days before a decoupled cleanup job removes anything 7 days old. This idea was done for auditing purposes, but normally I would create and destroy the data during the ingestion procedure.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Thanks,
Steven @ DFHZ
Created 09-01-2020 09:16 AM
@stevenmatison Thanks for your answer. As my tables are relatively small and only used to duplicate existing data - is there any way to remove the existing folders before importing new data?
regards