Support Questions

Find answers, ask questions, and share your expertise

Updating hive table with sqoop from mysql table

avatar
Expert Contributor

Hi everyone, I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`

sqoop import --connect jdbc:mysql://xxxx/retail_export --username xxxx --password xxx \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles

Unfortunatelly that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:

* the table doesn't have a PK

* if i dont specify --last-value as a parametter I will get duplicated records for those who already exist.

How could I figure it out without apply a truncate table or recreate the table using a PK? exist the way?

thanks guys.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Andres Urrego,

What you are looking for (UPSERTS) aren't available in SQOOP-import.

There are several approaches on how to actually update data in Hive. One of them is described here:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updati...

Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.

You can also check Hive ACID transactions, or using Hive-Hbase integration package.

Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...

View solution in original post

1 REPLY 1

avatar
Super Collaborator

@Andres Urrego,

What you are looking for (UPSERTS) aren't available in SQOOP-import.

There are several approaches on how to actually update data in Hive. One of them is described here:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updati...

Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.

You can also check Hive ACID transactions, or using Hive-Hbase integration package.

Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...