Created 07-03-2017 01:36 AM
I wanted to get the suggestion on the incremental strategy for tables be implemented : We have set of source table which are getting refreshed on the daily basis in the source ( DB2 ) and we need to refresh then in hive db as well, which approach will you suggest.
Please note : Source table have new inserts as well as updates to existing records;
1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables?
2) approach 2 : USe 4 step incremental table approach suggested by HDP ?
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
thanks
Abhijeet
Created 07-03-2017 08:32 PM
Have you considered SQL Merge statement? It's designed specifically for this.
Created 07-04-2017 02:47 PM
Created 07-06-2017 01:33 PM
I think there is always an interest in your approach of doing real-time inserts/updates/deletes into HBase and then front that with a Hive table, but... I don't believe you will get the kind of performance you are expecting when you start joining that table with first-class Hive tables, not to mention do any kind of analytical query (ok, any query that doesn't just read based on the rowKey). Not saying that isn't a valid approach, but you'd sure want to do some testing and even then you might find yourself doing the updates against HBase and periodically dumping that data into something you could use in a more first-class manner with Hive (and then you lost your real-time updates).
I do agree with the others who have commented on this Q about looking to Hive INSERT/UPDATE/DELETE options as well as the newly supported MERGE command. Plenty of testing will be needed to make sure this is your solution, but this is clearly the most developer-friendly model to chase and significant effort has gone into getting this working thus far and I expect more efforts to continue to broaden the scope and decrease the prerequisites.
Regarding Approach #2 and the incremental update blog post from 2014, I invite you to take a look at my materials from my 2015 Summit talk on this topic, https://martin.atlassian.net/wiki/x/GYBzAg, as I think there are a few options if you go down this "classical" data update path that could be considered (mostly based on size of data across the table and percentage of data being changed & the skewing of those updates; not to mention how frequent you need to sync-up with your source table).
Good luck and happy Hadooping!