Support Questions

Find answers, ask questions, and share your expertise

how to schedule sqoop import on daily bases

avatar
Explorer

hi,

wanted to move rdbms data into hdfs on daily bases using sqoop import.

is there any job shedular avaible in hadoop system which will run sqoop import command on daily bases ?

if  yes then please provide details such as how to set job sheduleing frequency and how to check log if job fails ?

 

regards

vivek

15 REPLIES 15

avatar
Expert Contributor

Hi,

 

There are 2 options that you can do.

 

 

1. You can create a shell script(Containing the Sqoop commands) and set a Cron based scheduler time to time as per the your requirement.

 

Link:- https://www.taniarascia.com/setting-up-a-basic-cron-job-in-linux/

 

2. Or you can create Oozie workflow jobs( which will run Sqoop actions frequently) and pass it out in coordinator.

 

Link:- https://oozie.apache.org/docs/4.1.0/DG_SqoopActionExtension.html (To create workflow.xml)

Link:- https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html ( To create coordinator)

 

Regards

Nitish

 

 

avatar
Explorer

hi,

i want to import rdbms table into hdfs usign sqoop and increment will happen on daily bases.

how sqoop will read max incremental value from previous upload ?

 

regards

vivek

avatar
Expert Contributor

Hi,

 

You can create Sqoop jobs which will store the value for last value into the embedded database.

 

Link:- https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_saved_jobs

 

Kindly go through above link which will help you to achieve this.

 

Regards

Nitish

avatar
Explorer

 

"If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows."

 

but  how to configure saved job

is it posssible to view this metastore or last imported values

 

 

avatar
Expert Contributor

Hi,

 

When ever you create sqoop job then it gets created as local metastore under $home/.sqoop folder.

 

I woul request you to confgiure Global metastore so that all the jobs can be accessed from other nodes as well.

 

Use the [Best Practices for Sqoop1 Metastore] instructions below to manage the Sqoop1 Metastore

 

[Best Practices for Sqoop1 Metastore]

1. Identify a cluster host (e.g. metastore1.cloudera.com) upon which to run the Metastore process

2. Use the [Deploy Sqoop1 Client Gateway] instructions to deploy it on the Metatore host

3. Use the [Configure Sqoop1 Metastore Directory] instructions to create a directory for the Metastore database

4. Use the [Set Sqoop1 Safety Valve sqoop-site.xml ( metastore )] instructions to configure the Metastore

5. Use the [Start Sqoop1 Metastore Java Process] instructions to start the Metastore process

6. Use the [Stop Sqoop1 Metastore Java Process] instructions to gracefully stop the Metastore process

 

 

~~~~~

 

[Deploy Sqoop1 Client Gateway]

1. Login to Cloudera Manager

2. Navigate to: (Home > Cluster)

3. Click button to the right of ClusterName

4. Select "Add a Service"

5. Select "Sqoop 1 Client"

6. Deploy the Gateway on each host that will run Sqoop1 CLI commands

7. Complete wizard

 

Also see Cloudera Documentation (Deploy Sqoop Client Gateway):

http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_mc_sqoop1_client...

 

~~~~~

 

[Configure Sqoop1 Metastore Directory]

1. Execute [commands] to setup the Sqoop1 Metastore directory

 

[commands]

ssh root@metastore.cloudera.com

mkdir /var/lib/sqoop-metastore (can be anywhere)

chown sqoop:sqoop /var/lib/sqoop-metastore

 

~~~~~

 

[Set Sqoop1 Safety Valve sqoop-site.xml ( metastore )]

1. Login to Cloudera Manager

2. Navigate to: (Home > Cluster > Sqoop1 Client Gateway > Configuration)

3. Search for "Sqoop 1 Client Client Advanced Configuration Snippet (Safety Valve) for sqoop-conf/sqoop-site.xml"

4. Add the [xml properties] below

5. Save Changes

6. Redeploy Sqoop1 Client Gateway

 

[xml properties]

<!-- START SQOOP1 CLIENT GATEWAY SQOOP-SITE.XML SAFETY VALVE CONFIGURATION -->

<property>

<name>sqoop.metastore.client.autoconnect.url</name>

<value>jdbc:hsqldb:hsql://metastore1.cloudera.com:16000/sqoop</value>

<description>THIS TELLS SQOOP1 WHERE TO GO TO CONNECT TO SHARED METASTORE</description>

</property>

<property>

<name>sqoop.metastore.server.location</name>

<value>/var/lib/sqoop-metastore/metastore.db</value>

<description>THIS TELLS SQOOP1 THE LINX PATH AND METASTORE NAME FOR THE DATABASE</description>

</property>

<property>

<name>sqoop.metastore.server.port</name>

<value>16000</value>

<description>THIS THE LISTEN PORT FOR HSQLDB</description>

</property>

<property>

<name>sqoop.metastore.client.record.password</name>

<value>false</value>

<description>IF TRUE, SAVING PASSWORDS IN METASTORE IS ALLOWED</description>

</property>

<property>

<name>sqoop.metastore.client.enable.autoconnect</name>

<value>true</value>

<description>IF TRUE, SQOOP WILL USE LOCAL METASTORE WHEN NO OTHER METASTORE ARGUMENTS ARE PROVIDED </description>

</property>

<!-- SQOOP1 CLIENT GATEWAY SQOOP-SITE.XML SAFETY VALVE CONFIGURATION -->

 

~~~~~

 

[Start Sqoop1 Metastore Java Process]

1. Execute [commands] to start the Sqoop1 Metastore process

 

[commands]

ssh root@metastore.cloudera.com

sudo -u sqoop nohup sqoop-metastore&

ps -aux | egrep -i metastore

 

Output:

root      3021  0.1  0.1 169388  3024 pts/1    S    01:49   0:00 sudo -u sqoop sqoop-metastore

sqoop     3022 11.8  3.9 1535656 75732 pts/1   Sl   01:49   0:02 /usr/java/jdk1.7.0_67-cloudera/bin/java -Xmx1000m -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native:::/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.sqoop.Sqoop metastore

root      3215  0.0  0.0 101012   860 pts/1    S+   01:49   0:00 egrep -i metastore

 

~~~~~

 

[Stop Sqoop1 Metastore Database]

1. Execute [commands] to stop the Sqoop1 Metastore

 

[commands]

ssh root@metastore.cloudera.com

sudo -u sqoop sqoop-metastore --shutdown

ps -aux | egrep -i metastore

 

Hope this helps.

 

 

Regards

Nitish

avatar
Explorer

Really great help.

Thanks lot.

 

regards

vivek

avatar
Expert Contributor

Hi,

 

You are welcome.


Regards

Nitish

avatar
Explorer

any practical example of how to shedule ozzie workflow

avatar
Expert Contributor

Hi,

 

 you can create Oozie workflow jobs( which will run Sqoop actions frequently) and pass it out in coordinator.

 

Link:- https://oozie.apache.org/docs/4.1.0/DG_SqoopActionExtension.html (To create workflow.xml)

Link:- https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html ( To create coordinator)

 

Regards

Nitish