Created 08-14-2017 07:54 PM
Hi everyone, I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`
sqoop import --connect jdbc:mysql://xxxx/retail_export --username xxxx --password xxx \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles
Unfortunatelly that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:
* the table doesn't have a PK
* if i dont specify --last-value as a parametter I will get duplicated records for those who already exist.
How could I figure it out without apply a truncate table or recreate the table using a PK? exist the way?
thanks guys.
Created 08-15-2017 10:54 PM
What you are looking for (UPSERTS) aren't available in SQOOP-import.
There are several approaches on how to actually update data in Hive. One of them is described here:
Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.
You can also check Hive ACID transactions, or using Hive-Hbase integration package.
Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...
Created 08-15-2017 10:54 PM
What you are looking for (UPSERTS) aren't available in SQOOP-import.
There are several approaches on how to actually update data in Hive. One of them is described here:
Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.
You can also check Hive ACID transactions, or using Hive-Hbase integration package.
Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...