- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
hive incremental approach
- Labels:
-
Apache HBase
-
Apache Hive
Created on 07-02-2017 06:42 PM - edited 09-16-2022 04:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wanted to get the suggestion on the incremental strategy for tables be implemented :
We have set of source table which are getting refreshed on the daily basis in the source ( DB2 )
and we need to refresh then in hive db as well, which approach will you suggest.
Source table have new inserts as well as updates to existing records;
1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables?
2) approach 2 : USe 4 step incremental table approach suggested by HDP ?
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Created 07-02-2017 07:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can consider few more points before choose one of the approach, like...
1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records
2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2
3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
Created 07-02-2017 07:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can consider few more points before choose one of the approach, like...
1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records
2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2
3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
