Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Updating hive table with sqoop from mysql table

avatar
Expert Contributor

Hi everyone, I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`

sqoop import --connect jdbc:mysql://xxxx/retail_export --username xxxx --password xxx \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles

Unfortunatelly that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:

* the table doesn't have a PK

* if i dont specify --last-value as a parametter I will get duplicated records for those who already exist.

How could I figure it out without apply a truncate table or recreate the table using a PK? exist the way?

thanks guys.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Andres Urrego,

What you are looking for (UPSERTS) aren't available in SQOOP-import.

There are several approaches on how to actually update data in Hive. One of them is described here:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updati...

Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.

You can also check Hive ACID transactions, or using Hive-Hbase integration package.

Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...

View solution in original post

1 REPLY 1

avatar
Super Collaborator

@Andres Urrego,

What you are looking for (UPSERTS) aren't available in SQOOP-import.

There are several approaches on how to actually update data in Hive. One of them is described here:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updati...

Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.

You can also check Hive ACID transactions, or using Hive-Hbase integration package.

Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...