Feel free to contact me. I'll make sure to keep this up-to-date.
Sqoop is the tool of choice for ingesting OLTP data to HDFS of Hive. Incremental ingestion requires to keep track of the last row ingested in order to know where to start the next round from. Using a shared Sqoop metastore and sqoop job will let sqoop take care of keeping track of the ingestion.
Oozie will be setup to call the sqoop job at the required interval of time. For this a coordinator will be required,
1a. Shared Metastore
Create a Sqoop metastore on a node (preferably not on a data/worker node).
- Create a folder to save the metastore DB file (e.g. "/var/lib/sqoop/metastore)
- configure the Metastore from Ambari > Sqoop > Configs
Save the change and copy the modified /tmp/hive-site.xml to HDFS
Note #1: if the file sqoop-site.xml is not reachable to Oozie/Sqoop, the Sqoop job will run once and fail subsequent runs logging the following entries
17/01/17 16:30:32 INFO sqoop.Sqoop: Running Sqoop version: 18.104.22.168.5.3.0-37
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: It seems that you have launched a Sqoop metastore job via
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: Oozie with sqoop.metastore.client.record.password disabled.
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: But this configuration is not supported because Sqoop can't
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: prompt the user to enter the password while being executed
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: as Oozie tasks. Please enable sqoop.metastore.client.record
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: .password in sqoop-site.xml, or provide the password
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: explicitly using --password in the command tag of the Oozie
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: workflow file.
Note #2 : When hive.exec.post.hooks are set, I have seen the workflow execution failing with a rate of 50-60%, after the data is ingested to Hive and beforethe Sqoop Metastore is updated. Generating duplicate ingestion in the Hive table
2b Update the Oozie share lib with the required jar files
/usr/hdp/current/hive-server2/lib/hive-shims-* to hdfs://user/oozie/share/lib/ lib_<timestamp>/hive/
/usr/hdp/current/atlas-server/hook/hive/*.jar to hdfs://user/oozie/share/lib/lib_<timestamp>/hive/
run the following command after having created a Kerberos ticket with Oozie keytab