Support Questions

Find answers, ask questions, and share your expertise

sqoop job for incremental import execution from oozie

avatar
Expert Contributor

I hear that metastore in sqoop can take care of the incremental imports and that way I do not need to keep track of the last updated id/datetime myself.

I am trying to execute this from an oozie WF but my question is

1,what goes into last-value parameter in sqoop command in that case(when I have a sqoop job and metastore configured)?(Do I need to even pass the parameter )?

2. Also, can I give multiple import statements in single sqoop job?

3. If yes, How?

4. Is it a good idea to execute multiple table imports in parallel? (I really would like to know the pros and cons attached to it).

5. If I plan to have table imports in parallel, do I just fork and execute jobs in oozie?

1 ACCEPTED SOLUTION

avatar
Master Guru

Just before we answer the questions before:

I prefer to do it a different way: If you run incremental loads you have a hard time to roll back imports tht may or may not have failed. It is easier if you associate each import with a partition in hive and then just delete that partition in case of failure.

I.e. if you want to load data hourly create a hourly partitioned table, run an hourly oozie job and use coord:dateformat to provide min/max parameters for that hour. This way you can just re-run the oozie instance in case of any failure and everything will be perfect. If you do incremental loads in the middle of a time time period you don't have much control of the data entering your tables. If you rerun a job you have duplicate data.

Apart from that:

1) If you want a central metastore for sqoop jobs that run in Oozie I think you need to setup the metastore and then use the --meta-connect parameter to it.

That jira is helpful

https://issues.apache.org/jira/browse/SQOOP-453

2) You can do import-all-tables

4) Depends. If its the same database I would say the answer is no. Since the bottleneck will be most likely the network or the database returning data. In that case its better to increase the number of mappers and run imports one by one. For small tables or ones you cannot partition loading in parallel might be good. However you will have a bit of overhead in the cluster since each parallel oozie job will have three empty containers ( oozie launcher AM, oozie launcher map, sqoop am ) this can add up on small clusters as well

5) yes

View solution in original post

5 REPLIES 5

avatar
Master Guru

Just before we answer the questions before:

I prefer to do it a different way: If you run incremental loads you have a hard time to roll back imports tht may or may not have failed. It is easier if you associate each import with a partition in hive and then just delete that partition in case of failure.

I.e. if you want to load data hourly create a hourly partitioned table, run an hourly oozie job and use coord:dateformat to provide min/max parameters for that hour. This way you can just re-run the oozie instance in case of any failure and everything will be perfect. If you do incremental loads in the middle of a time time period you don't have much control of the data entering your tables. If you rerun a job you have duplicate data.

Apart from that:

1) If you want a central metastore for sqoop jobs that run in Oozie I think you need to setup the metastore and then use the --meta-connect parameter to it.

That jira is helpful

https://issues.apache.org/jira/browse/SQOOP-453

2) You can do import-all-tables

4) Depends. If its the same database I would say the answer is no. Since the bottleneck will be most likely the network or the database returning data. In that case its better to increase the number of mappers and run imports one by one. For small tables or ones you cannot partition loading in parallel might be good. However you will have a bit of overhead in the cluster since each parallel oozie job will have three empty containers ( oozie launcher AM, oozie launcher map, sqoop am ) this can add up on small clusters as well

5) yes

avatar
Rising Star

Hi @Benjamin Leonhardi

I am trying to import a table from SQL Server to hbase using sqoop and through incremental import in sqoop trying to update the hbase table using empid and schedule the sqoop job using oozie workflow in order to make the job runs on particular time basis.

Eg)

SQL TABLE

create table employee(empid int primary key,empname varchar(35),designation varchar(30),salary int);

insert into employee values(300,'Azhar','MD',50000); insert into employee values(301,'vijay','GM',40000); insert into employee values(302,'rahul','Asst GM',35000); insert into employee values(303,'khanna','accountant',25000); insert into employee values(304,'vikram','sales manager',20000);

IMPORTING DATA INTO HBASE USING SQOOP

sqoop import --connect "jdbc:sqlserver://localhost:1433;database=US_DB" --username sa--password 12345 --table employee --hbase-table hb_emp --column-family empid --hbase-create-table

INCREMENTAL IMPORT IN SQOOP FOR SQL-HBASE TABLE

sqoop import --connect "jdbc:mysql://localhost;database=US_DB" --username root -P --table employee --hbase-table hb_emp --column-family cfemp --incremental append --check-column empid --last-value 304

SCHEDULING SQOOP INCREMENTAL JOB USING OOZIE FOR HBASE TABLE

Here is my job.properties and workflow.xml configuration

job.properties

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8050
queue.Name=default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.libpath=${nameNode}/user/root/share/lib
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/sqoop/

workflow.xml

<workflow-app name="sqoop-hbase-wf" xmlns="uri:oozie:workflow:0.2"> <start to="sqoop-import"/> <action name="sqoop-import"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>/user/root/hbase-site.xml</job-xml> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <command>sqoop import --connect "jdbc:mysql://localhost;database=US_DB" --username root -P --table employee --hbase-table hb_emp --column-family cfemp --incremental append --check-column empid --last-value 304</command> <file>/user/root/sqljdbc4.jar#sqljdbc4.jar</file> <file>/user/root/hbase/hbase-client-1.1.2.2.4.0.0-169.jar#hbase-client-1.1.2.2.4.0.0-169.jar</file> <file>/user/root/hbase/hbase-common-1.1.2.2.4.0.0-169.jar#hbase-common-1.1.2.2.4.0.0-169.jar</file> <file>/user/root/hbase/hbase-protocol-1.1.2.2.4.0.0-169.jar#hbase-protocol-1.1.2.2.4.0.0-169.jar</file> <file>/user/root/hbase/htrace-core-3.1.0-incubating.jar#htrace-core-3.1.0-incubating.jar</file> <file>/user/root/hbase/hbase-server-1.1.2.2.4.0.0-169.jar#hbase-server-1.1.2.2.4.0.0-169.jar</file> <file>/user/root/hbase/hbase-hadoop-compat-1.1.2.2.4.0.0-169.jar#hbase-hadoop-compat-1.1.2.2.4.0.0-169.jar</file> <file>/user/root/hbase/high-scale-lib-1.1.1.jar#high-scale-lib-1.1.1.jar</file> </sqoop> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>

ERROR

Getting struck while running oozie with sqoop-main exception. Please help me to solve this issue.

What are the compatible versions for this task to be completed usinf hdp 2.4.

ENVIRONMENT

Hortonworks 2.4 hdp

Thanks

avatar
New Contributor

I am reaaaaally late on this answer!!!

But since i recently faced this issue myself, i am gonna answer to help somebody else out. The solution was not to remove the sqoop keyword when passing the command tag in workflow.xml

Pass the command in this way.

<command>import --connect "jdbc:mysql://localhost;database=US_DB" --username root -P --table employee --hbase-table hb_emp --column-family cfemp --incremental append --check-column empid --last-value 304</command>

avatar
Explorer

@simran kaur Can you tell how did you ran multiple table imports using oozie/Sqoop ? apart from using import-all tables too.