Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best Approach to Update Hive Table


Best Approach to Update Hive Table

Rising Star

It's well known that appending (inserting) to Hive is fairly straightforward and simple but what about the case where you need to update? Example below:

customer_id, payment_id, payment_amt, status

#1 c123, p444, 20.00, send_money

#2 c123, p444, 20.00, cancel

In a typical Hive Streaming case you could just append the rows and reconcile them later using another process, but what if you would like to overwrite (update or delete and then insert) record #2 as it cancels the first record. In the past I would track both versions and group by primary key and take the max(upd_ts) or latest version, but that pushes the logic/processing to the read side. Is there any need for a staging table anymore where daily writes go to staging and multiple times a day a second process reconciles, creates a new partition, and inserts the data? I'm hoping with ACID there is a more elegant way to basically perform an "upsert" based on primary key. (This is ORC format)


Re: Best Approach to Update Hive Table

Super Guru

Alan gates presented awesome deck on hive acid and the how to's

Don't have an account?
Coming from Hortonworks? Activate your account here