Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Expert Contributor

Hello,

I've compiled this article after I struggled quiet a bit to have the ingestion working.

Credits to @Kuldeep Kulkarni, @Rahul Pathak and @awhitter for helping me figuring out these steps.

Feel free to contact me. I'll make sure to keep this up-to-date.

Sqoop is the tool of choice for ingesting OLTP data to HDFS of Hive. Incremental ingestion requires to keep track of the last row ingested in order to know where to start the next round from. Using a shared Sqoop metastore and sqoop job will let sqoop take care of keeping track of the ingestion.

Oozie will be setup to call the sqoop job at the required interval of time. For this a coordinator will be required,

1 Sqoop

1a. Shared Metastore

Create a Sqoop metastore on a node (preferably not on a data/worker node).

- Create a folder to save the metastore DB file (e.g. "/var/lib/sqoop/metastore)

- configure the Metastore from Ambari > Sqoop > Configs

sqoop.metastore.server.location "/var/lib/sqoop/metastore/shared.db"
sqoop.metastore.client.record.password "true" 
sqoop.metastore.server.port "17000"
sqoop.metastore.client.autoconnect.url "jdbc:hsqldb:hsql://<Metastore_node>:17000/sqoop

Start the Sqoop-Metastore process

"nohup sqoop-metastore &>>/var/log/sqoop-metastore.log &"

Sqoop unfortunately does not offer OS Service (yet). So ensure that the metastore startup command is added to your startup scripts

1b. Sqoop job

Notes:

Active directory authentication is not support yet. The user must be a SQL user.

As Sqoop will store the credentails in clear, the user must be created with the least privileges possible and monitored accordingly

while testing using AD integratio (";Integrated Security=SSPI" in the connection string), connection has been successful, however it's not (yet) a HWX supported situation

sqoop-job --create <JobName>  \
--meta-connect jdbc:hsqldb:hsql://<metastore_node>:17000/sqoop \
-- import \
--connect "jdbc:sqlserver://<SQLSERVER>:1433;database=<SQLDB>;" \
--username <username>  \
--password <password> \
--num-mappers 5 \
--table <SQL TableName>\
--append \
--incremental append \
--check-column <Base Column for incremental>\
--hive-drop-import-delims\
--hive-import \
--hive-table <Hive DB>.<Hive Table> \
-- -- --schema <MSSQL Schema>
-- -- --table-hints NOLOCK

The job will now appear in the job list

sqoop-job --list

and can be run manually

sqoop-job --exec <JobName>

2 Oozie

2a <application>-Site.xml files

If not already available in HDFS, copy core-site.xml files to HDFS

/usr/hdp/current/sqoop-client/conf/sqoop-site.xml 

Copy hive-site.xml to a temp location for editing

cp /usr/hdp/current/hive-server2/conf/hive-site.xml /tmp

Edit /tmp/hive-site.xml, locate the property

hive.exec.post.hooks

remove the value if any present

The result in the file:

	<property>
		<name>hive.exec.post.hooks</name>
		<value/>
	</property>

Save the change and copy the modified /tmp/hive-site.xml to HDFS

Note #1: if the file sqoop-site.xml is not reachable to Oozie/Sqoop, the Sqoop job will run once and fail subsequent runs logging the following entries

17/01/17 16:30:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.3.0-37 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: It seems that you have launched a Sqoop metastore job via 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: Oozie with sqoop.metastore.client.record.password disabled. 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: But this configuration is not supported because Sqoop can't 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: prompt the user to enter the password while being executed 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: as Oozie tasks. Please enable sqoop.metastore.client.record 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: .password in sqoop-site.xml, or provide the password 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: explicitly using --password in the command tag of the Oozie 
17/01/17 16:30:33 ERROR sqoop.SqoopOptions: workflow file. 

Note #2 : When hive.exec.post.hooks are set, I have seen the workflow execution failing with a rate of 50-60%, after the data is ingested to Hive and beforethe Sqoop Metastore is updated. Generating duplicate ingestion in the Hive table

2b Update the Oozie share lib with the required jar files

/usr/hdp/current/hive-server2/lib/hive-shims-* to hdfs://user/oozie/share/lib/ lib_<timestamp>/hive/
/usr/hdp/current/atlas-server/hook/hive/*.jar to hdfs://user/oozie/share/lib/lib_<timestamp>/hive/ 

run the following command after having created a Kerberos ticket with Oozie keytab

oozie admin -oozie https://<oozie_server>:11443/oozie -sharelibupdate 

2c Copy required jar files to HDFS

(Microsoft) sqljdbc.jar  to <HDFS_PATH>/jars/sqljdbc.jar
/usr/hdp/2.5.3.0-37/sqoop/lib/java-json.jar to  <HDFS_PATH>/jars/sqljdbc.jar

2d Create the workflow

File : workflow.xml

<global> </global> passed the required -site.xml to all the actions defined

<archive> </archive> lists the jar file required by Sqoop. (It seems that Sqoop will not parse the CLASS_PATH)

<workflow-app xmlns="uri:oozie:workflow:0.4" name="Worklow Sqoop Incremental">
    <global>
        <job-xml>/<HDFS_PATH>/sqoop-site.xml </job-xml>
         <job-xml>/<HDFS_PATH>/sqoop/hive-site.xml </job-xml>
    </global>

    <credentials>
        <credential name='hive_creds' type='hcat'>
        	<property>
          	 <name>hcat.metastore.uri</name>
          	 <value>thrift://<hive thrift URI>:9083</value>
         	</property>
         	<property>
          	 <name>hcat.metastore.principal</name>
           	<value>hive/_HOST@domain.name</value>
          </property>
     	</credential>
    </credentials>
<start to="SqoopStep"/>


<action name="SqoopStep" cred="hive_creds">
	<sqoop xmlns="uri:oozie:sqoop-action:0.3">
                 <job-tracker>${jobTracker}</job-tracker>
                 <name-node>${nameNode}</name-node>
                 <command>job --meta-connect jdbc:hsqldb:hsql://<metastorenode>:17000/sqoop --exec <SqoopJobname></command>
                 <archive><HDFS_PATH>/jars/sqljdbc.jar</archive>
                 <archive><HDFS_PATH>/jars/java-json.jar</archive>
         </sqoop>
         <ok to="end"/>
         <error to="fail"/>
	</action>


     <kill name="fail">
      <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
      </kill>


       <end name = "end"/>
</workflow-app>			

2e Create the coordinator

File : coordinator.xml

<coordinator-app name="Sqoop Incremental Ingestion Scheduler" frequency="${coord:minutes(5)}"
                  start="${start}" end="2100-01-01T08:00Z" timezone="UTC"
                 xmlns="uri:oozie:coordinator:0.1">
      <controls>
        <timeout>90</timeout>
      </controls>

      <action>
        <workflow>
          <app-path>/<HDFS_PATH>/<FOLDER_WHERE_TO_FIND_WORKLOW></app-path>
         </workflow>
      </action>
 </coordinator-app>

2f Edit job.properties

File : job.properties

nameNode=hdfs://<namenode>
jobTracker=<jobtracker>:8050
queueName=default
examplesRoot=examples
oozie.action.sharelib.for.sqoop=sqoop,hive,atlas
#oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
#oozie.wf.rerun.failnodes=true
user.name=oozie
oozie.coord.application.path=${nameNode}/HDFSPATH/Folder_To_find_coordinator.xml
outputDir=/HDFS_PATH/Oozie/OutputDir
start=2017-02-15T13:00Z


Keys setting :

oozie.action.sharelib.for.sqoop must be set to at least "sqoop, hive"

2g submitting

Copy both coordinator.xml & workflow.xml to HDFS (make sure to reflect the right paths in both job.properties & coordinator.xml)

submit the job to oozie (with a kerberos ticket generated with Oozie keytab)

oozie job -oozie https://<oozieserver>:11443/oozie -config <localpath>/job.properties -submit

Hopefully this will prove useful.

3,850 Views
Comments

what changes will be made if I'm importing from MySQL RDMS to hive by incremental importation?