04-17-2017 12:14 AM - edited 04-17-2017 05:25 AM
Im working with Sqoop incremental import by taking the data everyday into my hive table. I have the following scenario: I have an RDBMS table: 'empdata' with columns
id name city 1 Sid Amsterdam 2 Bob Delhi 3 Sun Dubai 4 Rob London
I am importing the data into Hive, using Sqoop incremental import through a cron job which shell script to do the work.
#!/bin/bash DATE=$(date +"%d-%m-%y") while IFS=":" read -r server dbname tablename; do sqoop import --connect jdbc:mysql://$server/$dbname --table $tablename --username root --password cloudera --hive-import --hive-table dynpart --hive-partition-key 'thisday' --hive-partition-value $DATE --target-dir '/user/hive/newimp5' --incremental-append --check-column id --last-value $(hive -e "select max(id) from $tablename"); done</home/cloudera/Desktop/MyScripts/tables.txt
The above script for incremental load is working fine. But now I have another requirement, which is to check if there are any updates to previous records. Like if the record:
1 Rob London is updated to 1 Rob NewYork
I need to take that updated record(s) along with the incremental import but only the updated value should be present in Hive table so that I don't have duplicate values either. Could anyone tell me how can I accomplish it ?