Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

insert or update if exists in hive

avatar
Expert Contributor

I have a table `old_data` and a table `new_data`. I want to merge these table using so that `old_data` gets updated with `new_data`

1. Rows in old_data stay there

2. New rows in new_data get added to old_data

3. unique key is `id` so rows with `id` in `new_data` should update existing ones in `old_data`

I think it is possible using join so I can do something like

INSERT OVERWRITE old_data
SELECT..Example: 
Table a:
id          count 
1             2
2            19 
3              4 


Table b:

    id             count
    2              22
    5              7 

I need a SELECT statement that gives me

   id            count 
    1              2 
    2              22
    3              4
    5              7
4 REPLIES 4

avatar
Guru

If you do not have to worry about partitions, it is as you state:

INSERT OVERWRITE old_data SELECT <statement> FROM new_data;

If you have a partition you must specify it as

INSERT OVERWRITE old_data PARTITION (id = <value>) SELECT <statement filtering by id> FROM new_data;

Note for the SELECT statement you have to select the same columns and column order as those you are inserting into.

See the following for more color:

https://community.hortonworks.com/questions/28683/overwriting-a-column-in-hive.html

https://community.hortonworks.com/questions/5579/insert-overwrite-of-2-gb-data.html

https://community.hortonworks.com/questions/49967/insert-overwrite-running-too-slow-when-inserting-d...

avatar
Expert Contributor

PLease check the update. I think I didn't explain it properly

avatar
Expert Contributor

PLease check the update. I think I didn't explain it properly

avatar
Super Collaborator

Assuming 'id' is never null in either table

select

case when s.id is null then t.id else s.id end,

case when s.id is null then t.count else s.count end

from old_data t full outer join new_data s on t.id = s.id

though what you really want is the MERGE statement which is WIP (https://issues.apache.org/jira/browse/HIVE-10924)