Created on 07-02-2017 06:42 PM - edited 09-16-2022 04:52 AM
I wanted to get the suggestion on the incremental strategy for tables be implemented :
We have set of source table which are getting refreshed on the daily basis in the source ( DB2 )
and we need to refresh then in hive db as well, which approach will you suggest.
Source table have new inserts as well as updates to existing records;
1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables?
2) approach 2 : USe 4 step incremental table approach suggested by HDP ?
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Created 07-02-2017 07:33 PM
You can consider few more points before choose one of the approach, like...
1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records
2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2
3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
Created 07-02-2017 07:33 PM
You can consider few more points before choose one of the approach, like...
1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records
2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2
3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update