Support Questions
Find answers, ask questions, and share your expertise

Hive Upserts

Hive Upserts

Explorer

Hi All,

I'm trying to update the JSON data into Hive using this approach https://goo.gl/J7chi3 .

While doing this approach I have two issues.

1.My Input record is 20k and output record count is 23.5k approx . some json's are breaking creating duplicates

2.My Input record size is 10k and output record count is 20k . As per the link it should be updating the records if it is already present in the table .

Can anyone guide me to do upserts in hive .

Apart from the above mentioned methods , Few other failed methods for upserts

I have tried to use Merge options in hive refer link : https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html --> This merge is not suitable to merge more than 5 GB or more . Taking more hours to complete or not to complete or getting Heap memory error for even 6 GB data.

Someone Suggested Merge with Source and Destination as partition. We will be getting error if the destination is partitioned since Merge cannot update the partition key value .

Cluster ram size is 250GB

Can anyone help me in this Please with definitive steps. . But it should work for Upserts(when record matched then update , if not then Insert) for larger datasets more than 5TB . None of the solutions are working out there in the Internet so far more than a month.Could anyone let me know the valid steps for larger datasets with JSON.