Support Questions

Find answers, ask questions, and share your expertise

How to do Sqoop Incremental Import through Oozie Coordinator

avatar
Explorer

I want to continuously import data from Postgres to Hive through Sqoop Oozie Coordinator Jobs. In incremental import we need to give the last value to start the import from last value.But in this case we don't know the last value as the data is growing continuously ..Is there any way to automate this process?

Could anyone please help me in resolving the issue

 

20 REPLIES 20

avatar
Super Guru
Hi Nekkanti,

You can use Sqoop saved jobs feature, where Sqoop will remember the last incremental import and continue from where it left off, please refer to below doc upstream:

http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_saved_jobs

>> If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.

Cheers
Eric

avatar
Explorer

Hi Eric,

 

Can we write this Sqoop  saved job creation command in Oozie Workflow.xml file?

 

I set the frequency of Coordinator  to every 30 mins. My requirement is for every 30 Mins the Coordinator job is executed and need to import the data from Postgres to Hive and we don't the know last value of the records previously which it is inserted because we got the data continuously.

If we write the saved job in xml will it execute and import the data from Postgres to Hive?

 

I am running the Coordinator workflow from HUE not through command line.

 

Could you please explain me in detail 

avatar
Explorer

Hi Eric,

 

I tried  to execute the Sqoop saved jobs command through Oozie Coordinator.I executed the Coordinator jobs through HUE .When  I submit the Coordinator job its throwing an error:

 

$SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration

 

This is my workflow.xml :

 

<workflow-app name="sqoop-coordwf" xmlns="uri:oozie:workflow:0.4">
<start to="sqoop-coordaction"/>
<action name="sqoop-coordaction">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="hdfs://HadoopNode3:8020/user/cloudera/student"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<arg>sqoop job</arg>
<arg>--create</arg>
<arg>myjob</arg>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:postgresql://172.19.250.152:5432/inet</arg>
<arg>--username</arg>
<arg>postgres</arg>
<arg>--password</arg>
<arg>postgres</arg>
<arg>--table</arg>
<arg>company</arg>
<arg>--hive-import</arg>
<arg>--hive-table</arg>
<arg>test</arg>
<arg>-m</arg>
<arg>1</arg>
<file>/user/oozieschedules/sqoop/coordinator/hive-site.xml#hive-site.xml</file>

</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Failed, Error Message</message>
</kill>
<end name="end"/>
</workflow-app>

 

 

could you please tell me is there any other alternatives for doing sqoop  incremental import....

avatar
Super Guru
Hi Nekkanti,

The Sqoop action in Oozie, you do not need "sqoop" parent command anymore, you just need sub-command. So change "<arg>sqoop job</arg>" to "<arg>job</arg>".

Also, you should not create job through Oozie, because if you run it again, it might fail, you should create job first and then just execute the job through Oozie so that Sqoop will pick up last value automatically.

If still fails, please share the oozie launcher log for reviewing the error messages.

Cheers

Eric

avatar
Explorer

Hi Eric,

 

When we tried to create sqoop saved jobs through this command:

 

sqoop job --create  myjob -- import --connect jdbc:postgresql://ipaddress:5432/inet --username postgres --P --table company --fields-terminated-by '|' --lines-terminated-by '\n' --hive-import --hive-table sampletest

 

 It throws the following error:

 

19/08/16 15:52:03 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.16.1
19/08/16 15:52:04 ERROR tool.JobTool: I/O error performing job operation: java.io.IOException: Exception creating SQL connection
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.init(HsqldbJobStorage.java:217)
at org.apache.sqoop.metastore.hsqldb.AutoHsqldbStorage.open(AutoHsqldbStorage.java:112)
at org.apache.sqoop.tool.JobTool.run(JobTool.java:290)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.sql.SQLException: socket creation error
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.<init>(Unknown Source)
at org.hsqldb.jdbcDriver.getConnection(Unknown Source)
at org.hsqldb.jdbcDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.init(HsqldbJobStorage.java:177)
... 8 more

 

 Could you please help me in resolving the issue....

 

avatar
Expert Contributor

Hi,


Can you share the output of below command from the host on which you are running the Sqoop job?

 

## ls -ltr /var/lib/sqoop

## ll -aR /home/<username>

 

This will help us to check the permission on these folders.

 

Regards

Nitish

avatar
Explorer

Hi,

 

The output of ls -ltr is:

 

total 600
-rw-r--r-- 1 root root 505 Nov 22 2018 NOTICE.txt
-rw-r--r-- 1 root root 15419 Nov 22 2018 LICENSE.txt
drwxr-xr-x 2 root root 4096 Nov 22 2018 cloudera
lrwxrwxrwx 1 root root 30 Nov 22 2018 sqoop-test.jar -> sqoop-test-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 25 Nov 22 2018 sqoop.jar -> sqoop-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 15 Nov 22 2018 conf -> /etc/sqoop/conf
lrwxrwxrwx 1 root root 41 Nov 22 2018 sqoop-test-1.4.6-cdh5.16.1.jar -> ../../jars/sqoop-test-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 36 Nov 22 2018 sqoop-1.4.6-cdh5.16.1.jar -> ../../jars/sqoop-1.4.6-cdh5.16.1.jar
drwxr-xr-x 2 root root 4096 Nov 22 2018 lib
-rw-r--r-- 1 root root 579785 Jul 19 16:48 postgresql-9.2-1002.jdbc4.jar
drwxr-xr-x 3 root root 4096 Aug 23 17:45 bin

 

 

could you please check and resolve the issue.....

 

 

Thanks,

Akhila.

 

 

 

avatar
Expert Contributor

Hi Akhila,

 

This is the output of what command?

 

Can you please run the below commands from the host and share the output.

 

## ls -ltr /var/lib/sqoop

## ls -ltra /home/<username>

 

Regards

Nitish

avatar
Explorer

Hi,

 

Sqoop is not there in var/lib/sqoop 

Sqoop is installed in /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/sqoop in my environment.

 

so i ran the command:

ls -ltr  /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/sqoop

 

output:

 

total 600
-rw-r--r-- 1 root root 505 Nov 22 2018 NOTICE.txt
-rw-r--r-- 1 root root 15419 Nov 22 2018 LICENSE.txt
drwxr-xr-x 2 root root 4096 Nov 22 2018 cloudera
lrwxrwxrwx 1 root root 30 Nov 22 2018 sqoop-test.jar -> sqoop-test-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 25 Nov 22 2018 sqoop.jar -> sqoop-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 15 Nov 22 2018 conf -> /etc/sqoop/conf
lrwxrwxrwx 1 root root 41 Nov 22 2018 sqoop-test-1.4.6-cdh5.16.1.jar -> ../../jars/sqoop-test-1.4.6-cdh5.16.1.jar
lrwxrwxrwx 1 root root 36 Nov 22 2018 sqoop-1.4.6-cdh5.16.1.jar -> ../../jars/sqoop-1.4.6-cdh5.16.1.jar
drwxr-xr-x 2 root root 4096 Nov 22 2018 lib
-rw-r--r-- 1 root root 579785 Jul 19 16:48 postgresql-9.2-1002.jdbc4.jar
drwxr-xr-x 3 root root 4096 Aug 23 17:45 bin

 

ls -ltra /home/ubuntu

 

output:

 

total 931732
-rw-r--r-- 1 ubuntu ubuntu 675 Nov 22 2018 .profile
-rw-r--r-- 1 ubuntu ubuntu 3637 Nov 22 2018 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 220 Nov 22 2018 .bash_logout
drwx------ 2 ubuntu ubuntu 4096 Nov 22 2018 .cache
-rw------- 1 ubuntu ubuntu 28 Nov 22 2018 .bash_history
drwxr-xr-x 3 root root 4096 Dec 4 2018 elasticsearch-hadoop-6.3.2
drwxr-xr-x 5 root root 4096 Feb 18 2019 venv
-rw-r--r-- 1 root root 470003663 May 10 14:35 signalstrength_20190510.backup
drwxr-xr-x 3 root root 4096 Aug 7 14:37 ..
-rw-r--r-- 1 root root 470994315 Aug 20 15:45 inet.sql
-rw-r--r-- 1 root root 13040148 Aug 21 17:42 npa.sql
drwxr-xr-x 2 root root 4096 Aug 23 12:31 python
-rw-r--r-- 1 root root 323 Aug 23 17:46 script.sh
drwxr-xr-x 6 ubuntu ubuntu 4096 Aug 23 17:46 .

 

 

Regards,

Akhila.