Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Schedule shell script to run parallelly in oozie

avatar
Contributor

I have a shell script in HDFS. I have scheduled this script in oozie with the following workflow.

 

Workflow:

 

<workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8f63"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8f63">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>shell.sh</exec>
<argument>${input_file}</argument>
<env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
<file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
<file>/user/xxxx/args/${input_file}#${input_file}</file>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

job properties

 

nameNode=xxxxxxxxxxxxxxxxxxxx
jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx


args file

 

tableA
tableB
tablec
tableD

Now the shell script runs for single table in args file. This workflow is executed successfully without any errors.

 

How can I schedule this shell script to run in parallel. I want the script to run for 10 table at the same time.

 

What are the steps needed to do so. What changes should I make to the workflow.

 

Should I created 10 workflow for running 10 parallel jobs. Or what are the best scenarios to deal with this issue.

11 REPLIES 11

avatar
Champion
I am not 100% on this covering what you are asking but I think it does.

Look into Oozie coordinator and bundles. I feel like Coordinators are typically used for trigger workflows based on conditions but you could use one to launching different versions of your workflow (i.e. 10 different tables passed as args for each) and have them run at the same time.

https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html

avatar
Contributor
What I am looking for is in Linux shell we can run the same script for 10 different argument s at the same time.

Can I use the same in oozie? Because when I called a new shell script that executes my script for 10 arguments in Parallel in linux it works fine.

But when I try to do the same in oozie the job fails

why is this happening? What changes should I make for my workflow

avatar
Champion
Is that ten different files passes as arguments? And these files contain any number of tables?

Can you post the logs or error that oozie hit when trying to run the second version?

I think the first issue would be that you need to provide each argument file to the workflow so that it can be passed to the container. I see that you have this for one file but did you change this or add the second arg file? Changing it for each run is not fun. Do you need to pass the arguments from a file or can they be pass to the script directly? Then you can get around the need to provide each argument file to each different workflow. I still think that a coordinate is what you are looking for. I haven't messed with one in a while but from my recollection you can reference an existing workflow multiple times and set its own conditions or schedule.

avatar
Contributor

@mbigelow The args file contains all the tables. I want to pass 10 arguments from the same file.

 

So for the workflow, I believe I have to pass only one args file. How can I pass 10 arguments from a single file to the same workflow to run the script in parallel

avatar
Champion
Ah ok. To help I would need to see the script, or at least a snippet of it, and the error it is hitting. I can't think of any issue with doing this in oozie. I could drum something up to test this but it would make it straightforward to know exactly how it is running in parallel outside of oozie.

Passing the file with the args, parsing it out, and running whatever command in parallel is all contained withing the script.

Let me try to run through this. Say your script takes in a file with just words. You parse it and echo the word to a file. The writing to a file is ran in parallel in the script. Oozie will launch a container on a worker node. The script and the file will both be made available to the job that will run in the container. The script launches and runs. Lets assume 10 words, then 10 processes will be forked and ran. The gist here is that Oozie doesn't do anything to make it run in parallel; it is still dependent on the script for it.

What I was thinking previously is that you wanted to run the script/workflow itself in parallel and not a command within the script. And parallel here isn't taking advantage of the cluster for parallelism. It will run just as fast locally on a host and not through Oozie.

avatar
Contributor

@mbigelow Here is my shell script

 

#!/bin/bash
LOG_LOCATION=/home/$USER/logs
exec 2>&1

[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }

table=$1

TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log

#Function to get the status of the job creation
function log_status
{
       status=$1
       message=$2
       if [ "$status" -ne 0 ]; then
                echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
                #echo "Please find the attached log file for more details"
                exit 1
                else
                    echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
                fi
}

`hive -e "create table testing.${table} stored as parquet as select * from fishing.${table}"`

g_STATUS=$?
log_status $g_STATUS "Hive create ${table}"

echo "***********************************************************************************************************************************************************************"

avatar
Contributor

@mbigelow Hi did you get a chance to look at my problem?

avatar
Contributor

@mbigelow  

I have another bigshell.sh script that takes shell.sh script and executes the shell.sh script 10 times in parallel.

 

Now when I run this script in oozie It fails

 

Contents of bigshell.sh 

 

nl -n rz  test | xargs -n 2 --max-procs 10 sh -c 'shell.sh "$1"  > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'

 

when I change the max-procs to 1 instead of 10 then the script is succesful.

 

nl -n rz  test | xargs -n 2 --max-procs 1 sh -c 'shell.sh "$1"  > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'

avatar
Explorer

Oozie has a control structure, named "Fork Join", to run multiple Actions in parallel. Looks like it's exactly what you need (provided the number of Actions is fixed and immutable, and the arguments are hard-coded in the Workflow).

 

Look into that "Hooked for Hadoop" tutorial for example, section 5.0. Fork-Join controls

http://hadooped.blogspot.com/2013/07/apache-oozie-part-9a-coordinator-jobs.html