Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Upserts

Hive Upserts


Hi All,

I'm trying to update the JSON data into Hive using this approach .

While doing this approach I have two issues.

1.My Input record is 20k and output record count is 23.5k approx . some json's are breaking creating duplicates

2.My Input record size is 10k and output record count is 20k . As per the link it should be updating the records if it is already present in the table .

Can anyone guide me to do upserts in hive .

Apart from the above mentioned methods , Few other failed methods for upserts

I have tried to use Merge options in hive refer link : --> This merge is not suitable to merge more than 5 GB or more . Taking more hours to complete or not to complete or getting Heap memory error for even 6 GB data.

Someone Suggested Merge with Source and Destination as partition. We will be getting error if the destination is partitioned since Merge cannot update the partition key value .

Cluster ram size is 250GB

Can anyone help me in this Please with definitive steps. . But it should work for Upserts(when record matched then update , if not then Insert) for larger datasets more than 5TB . None of the solutions are working out there in the Internet so far more than a month.Could anyone let me know the valid steps for larger datasets with JSON.

Don't have an account?
Coming from Hortonworks? Activate your account here