Support Questions
Find answers, ask questions, and share your expertise

Best Approach to Update Hive Table

Best Approach to Update Hive Table

Rising Star

It's well known that appending (inserting) to Hive is fairly straightforward and simple but what about the case where you need to update? Example below:

customer_id, payment_id, payment_amt, status

#1 c123, p444, 20.00, send_money

#2 c123, p444, 20.00, cancel

In a typical Hive Streaming case you could just append the rows and reconcile them later using another process, but what if you would like to overwrite (update or delete and then insert) record #2 as it cancels the first record. In the past I would track both versions and group by primary key and take the max(upd_ts) or latest version, but that pushes the logic/processing to the read side. Is there any need for a staging table anymore where daily writes go to staging and multiple times a day a second process reconciles, creates a new partition, and inserts the data? I'm hoping with ACID there is a more elegant way to basically perform an "upsert" based on primary key. (This is ORC format)


Re: Best Approach to Update Hive Table

Super Guru

Alan gates presented awesome deck on hive acid and the how to's