Reply
New Contributor
Posts: 2
Registered: ‎08-17-2016

Shell script to run sqoop jobs in parellel inside oozie

[ Edited ]

Hello all,

 

I am trying to run a shell script which kicks off sqoop jobs in parellel. The idea is each sqoop job will collect data from each schema and all these sqoop jobs run in parellel and add this data to a hive table.

 

The shell script works just fine if I run command line, but stops abruptly when invoked in oozie.

The logs show kicking off multiple sqoop jobs but after that it just stops. 

Below is shell script and logs.

 

Please advise.

 

 

#!/bin/bash
filename="$1"

#Function to parse comma separated input
parse() {
    IFS=',' read -a array <<< "$1"
    vpdKey="${array[0]}"
    schemaName="${array[1]}"
    echo "$vpdKey"
}

while read -r line
do
    vpd="$line"
    parse $vpd
    # Invoke sqoop job
    vpd=$vpdKey
    schema=$schemaName
sqoop import --connect jdbc:oracle:thin:@//connection string --username blah —password blah --table "$schema".mytable -m 1 --where "vpd_key='"$vpd"' " --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/blah --hive-import --hive-table newhivetable &

    echo "------------------------------------------------------------------------------------------------"
done < "$filename"

 

Log

 

 

tdoutput Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Stdoutput Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Stdoutput Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Stdoutput Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Stdoutput Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Stdoutput Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Stdoutput 2016-08-17 08:24:04,145 INFO  [main] sqoop.Sqoop (Sqoop.java:(92)) - Running Sqoop version: 1.4.6-cdh5.7.0
Stdoutput 2016-08-17 08:24:04,255 WARN  [main] tool.BaseSqoopTool (BaseSqoopTool.java:applyCredentialsOptions(1023)) - Setting your password on the command-line is insecure. Consider using -P instead.
Stdoutput 2016-08-17 08:24:04,278 INFO  [main] tool.BaseSqoopTool (BaseSqoopTool.java:validateOutputFormatOptions(1355)) - Using Hive-specific delimiters for output. You can override
Stdoutput 2016-08-17 08:24:04,279 INFO  [main] tool.BaseSqoopTool (BaseSqoopTool.java:validateOutputFormatOptions(1356)) - delimiters with --fields-terminated-by, etc.
Stdoutput 2016-08-17 08:24:04,291 INFO  [main] sqoop.Sqoop (Sqoop.java:(92)) - Running Sqoop version: 1.4.6-cdh5.7.0
Exit code of the Shell command 0
<<< Invocation of Shell command completed <<<


<<< Invocation of Main class completed <<<


Oozie Launcher ends

 

Contributor
Posts: 47
Registered: ‎01-05-2016

Re: Shell script to run sqoop jobs in parellel inside oozie

[ Edited ]

Is there a particular reason behind the fact that you don't build a Workflow with several different Sqoop actions in parallel? You can do it easily and graphically in Hue, and that's the correct way of doing it...

 

Apart from that. The logs you posted are too "high level", to get to more detailed logs you should click on the "logs icon" that pops up inside the "Shell Action Box", in your Workflow, a few seconds after you submit the Workflow itself.

 

Clicking it should result in a new screen for the Shell Action where you'll have "standard output" and "standard error" tabs, containing more information.

 

Or, you could log in into Cloudera Manager and look for errors in the last 30 minute, minimum log level "Warning", Sources --> Oozie, Hive, Hdfs, Sqoop, Hue

 

In case, use the "Diagnostics" --> "Logs" tool in Cloudera Manager to perform your logs search

 

 

Let us know! :)

New Contributor
Posts: 2
Registered: ‎08-17-2016

Re: Shell script to run sqoop jobs in parellel inside oozie

Thanks for your reply.

 

The reason for using shell script and not parellel job design is to make it dynamic and handle if any new schemas (for example I am usinng vpds but will be schemas in prod) are added in future.

The logs I posted are from stdout tab. Please advise.

stdout

standard logs above

HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs:
HADOOP_CLIENT_OPTS=:
PREVLEVEL=N:
CONTAINER_ID=container_1471289760180_0031_01_000002:
HOME=/home/:
LANG=en_US.UTF-8:
YARN_NICENESS=0:
YARN_IDENT_STRING=yarn:
HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce:
=================================================================

>>> Invoking Shell command line now >>

Stdoutput vpd1
Stdoutput ------------------------------------------------------------------------------------------------
Stdoutput vpd2
Stdoutput ------------------------------------------------------------------------------------------------
Stdoutput vpd3
Stdoutput ------------------------------------------------------------------------------------------------
Stdoutput Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Stdoutput Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Stdoutput Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Stdoutput Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Exit code of the Shell command 0
<<< Invocation of Shell command completed <<<


<<< Invocation of Main class completed <<<


Oozie Launcher ends

stderr

-- empty --Note: Recompile with -Xlint:deprecation for details.

syslog

 

2016-08-19 08:38:26,713 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-08-19 08:38:26,714 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2016-08-19 08:38:26,727 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2016-08-19 08:38:26,727 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1471289760180_0031, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@441357d7)
2016-08-19 08:38:26,780 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: RM_DELEGATION_TOKEN, Service: 127.0.0.1:8032, Ident: (owner=abhishek, renewer=oozie mr token, realUser=oozie, issueDate=1471621096236, maxDate=1472225896236, sequenceNumber=92, masterKeyId=5)
2016-08-19 08:38:26,898 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2016-08-19 08:38:27,278 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/abhishek/appcache/application_1471289760180_0031
2016-08-19 08:38:27,861 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2016-08-19 08:38:28,482 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
2016-08-19 08:38:28,502 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2016-08-19 08:38:28,830 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit@181838a7
2016-08-19 08:38:28,841 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2016-08-19 08:38:28,880 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
2016-08-19 08:38:43,002 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1471289760180_0031_m_000000_0 is done. And is in the process of committing

 logs from yarn

 

Log Type: stderr
Log Upload Time: Fri Aug 19 08:38:51 -0700 2016
Log Length: 2275
Aug 19, 2016 8:38:25 AM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Aug 19, 2016 8:38:25 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
Aug 19, 2016 8:38:25 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Aug 19, 2016 8:38:25 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
Aug 19, 2016 8:38:25 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Aug 19, 2016 8:38:25 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Aug 19, 2016 8:38:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Aug 19, 2016 8:38:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Log Type: stdout
Log Upload Time: Fri Aug 19 08:38:51 -0700 2016
Log Length: 0

Log Type: syslog
Log Upload Time: Fri Aug 19 08:38:51 -0700 2016
Log Length: 26243
Showing 4096 bytes of 26243 total. Click here for the full log.
ra:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031_conf.xml_tmp
2016-08-19 08:38:44,022 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Copied to done location: hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031_conf.xml_tmp
2016-08-19 08:38:44,033 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031.summary_tmp to hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031.summary
2016-08-19 08:38:44,036 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031_conf.xml_tmp to hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031_conf.xml
2016-08-19 08:38:44,043 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031-1471621096430-abhishek-oozie%3Alauncher%3AT%3Dshell%3AW%3Dtestmultiplescoop%3AA%3Dshell-1471621123619-1-0-SUCCEEDED-root.abhishek-1471621102427.jhist_tmp to hdfs://quickstart.cloudera:8020/tmp/hadoop-yarn/staging/history/done_intermediate/abhishek/job_1471289760180_0031-1471621096430-abhishek-oozie%3Alauncher%3AT%3Dshell%3AW%3Dtestmultiplescoop%3AA%3Dshell-1471621123619-1-0-SUCCEEDED-root.abhishek-1471621102427.jhist
2016-08-19 08:38:44,045 INFO [Thread-71] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop()
2016-08-19 08:38:44,046 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1471289760180_0031_m_000000_0
2016-08-19 08:38:44,047 INFO [Thread-71] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : quickstart.cloudera:39926
2016-08-19 08:38:44,121 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1471289760180_0031_m_000000_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCEEDED
2016-08-19 08:38:44,125 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to 
2016-08-19 08:38:44,127 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: History url is http://quickstart.cloudera:19888/jobhistory/job/job_1471289760180_0031
2016-08-19 08:38:44,140 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Waiting for application to be successfully unregistered.
2016-08-19 08:38:45,142 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:0
2016-08-19 08:38:45,145 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting staging directory hdfs://quickstart.cloudera:8020 /tmp/hadoop-yarn/staging/abhishek/.staging/job_1471289760180_0031
2016-08-19 08:38:45,156 INFO [Thread-71] org.apache.hadoop.ipc.Server: Stopping server on 55341
2016-08-19 08:38:45,158 INFO [IPC Server listener on 55341] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 55341
2016-08-19 08:38:45,163 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2016-08-19 08:38:45,164 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted
2016-08-19 08:38:45,165 INFO [Ping Checker] org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: TaskAttemptFinishingMonitor thread interrupted

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Contributor
Posts: 47
Registered: ‎01-05-2016

Re: Shell script to run sqoop jobs in parellel inside oozie

I don't see any macroscopic problems in the logs you posted. Nevertheless, in the "yarn logs" section we can see your container gets killed by the resource manager...

 

...
2016-08-19 08:38:44,046 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1471289760180_0031_m_000000_0
...

This looks a lot like a problem that happened to me before. In a matter of fact, it looks like if you use Oozie to schedule a Shell Action, the action gets to be executed by the "yarn" user, and not by the user logged in in Hue.

 

This, in my case, caused a trivial permission problem when all boiled down to the final write in the destination HDFS directory.

 

Maybe you can try the following:

 

1) Insert a "sleep 100" as first thing in your shell script, and then compare if the total running time of the Workflow before getting killed is going to be 100 seconds longer than without the "sleep". If this is the case, we'll know that the shell script has been correctly loaded and executing, and the problem is not in the way you launch it

 

2) If (1) is confirmed, try to change to 777 the permissions for your destination folder "/user/hive/blah" (I see you are saving as parquetfile in that directory) and try again

 

3) in any case, have a look at this thread, maybe it can be of inspiration to work out something:

 

http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-Workflow-Shell-Action-getting-k...

 

4) If nothing works, try again drilling down in the logs with the "Diagnostics" --> "Logs" tool in Cloudera Manager, minimum log level "ERROR", Sources "EVERYTHING"

 

HTH

Explorer
Posts: 8
Registered: ‎10-26-2017

Re: Shell script to run sqoop jobs in parellel inside oozie

Hi,

 

I had the same problem... After going through the logs, I found that the ojdbc driver could not be found. The simplest solution was to copy ojdbc.jar into yarn home directory:

 

/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/hadoop-yarn 

on all nodes.

Announcements