Created on 04-16-2017 10:03 AM - edited 09-16-2022 04:28 AM
I have a shell script in HDFS. I have scheduled this script in oozie with the following workflow.
Workflow:
<workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5"> <start to="shell-8f63"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="shell-8f63"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>shell.sh</exec> <argument>${input_file}</argument> <env-var>HADOOP_USER_NAME=${wf:user()}</env-var> <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file> <file>/user/xxxx/args/${input_file}#${input_file}</file> </shell> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
job properties
nameNode=xxxxxxxxxxxxxxxxxxxx jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx queueName=default oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx
args file
tableA tableB tablec tableD
Now the shell script runs for single table in args file. This workflow is executed successfully without any errors.
How can I schedule this shell script to run in parallel. I want the script to run for 10 table at the same time.
What are the steps needed to do so. What changes should I make to the workflow.
Should I created 10 workflow for running 10 parallel jobs. Or what are the best scenarios to deal with this issue.
Created 04-17-2017 09:41 AM
Created 04-17-2017 10:01 AM
Created 04-17-2017 12:46 PM
Created 04-17-2017 01:38 PM
@mbigelow The args file contains all the tables. I want to pass 10 arguments from the same file.
So for the workflow, I believe I have to pass only one args file. How can I pass 10 arguments from a single file to the same workflow to run the script in parallel
Created 04-17-2017 02:20 PM
Created 04-17-2017 02:46 PM
@mbigelow Here is my shell script
#!/bin/bash LOG_LOCATION=/home/$USER/logs exec 2>&1 [ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; } table=$1 TIMESTAMP=`date "+%Y-%m-%d"` touch /home/$USER/logs/${TIMESTAMP}.success_log touch /home/$USER/logs/${TIMESTAMP}.fail_log success_logs=/home/$USER/logs/${TIMESTAMP}.success_log failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log #Function to get the status of the job creation function log_status { status=$1 message=$2 if [ "$status" -ne 0 ]; then echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}" #echo "Please find the attached log file for more details" exit 1 else echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}" fi } `hive -e "create table testing.${table} stored as parquet as select * from fishing.${table}"` g_STATUS=$? log_status $g_STATUS "Hive create ${table}" echo "***********************************************************************************************************************************************************************"
Created 04-19-2017 07:05 AM
@mbigelow Hi did you get a chance to look at my problem?
Created 04-17-2017 03:27 PM
I have another bigshell.sh script that takes shell.sh script and executes the shell.sh script 10 times in parallel.
Now when I run this script in oozie It fails
Contents of bigshell.sh
nl -n rz test | xargs -n 2 --max-procs 10 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'
when I change the max-procs to 1 instead of 10 then the script is succesful.
nl -n rz test | xargs -n 2 --max-procs 1 sh -c 'shell.sh "$1" > /tmp/logging/`date "+%Y-%m-%d"`/"$1"'
Created 04-18-2017 08:10 AM
Oozie has a control structure, named "Fork Join", to run multiple Actions in parallel. Looks like it's exactly what you need (provided the number of Actions is fixed and immutable, and the arguments are hard-coded in the Workflow).
Look into that "Hooked for Hadoop" tutorial for example, section 5.0. Fork-Join controls
http://hadooped.blogspot.com/2013/07/apache-oozie-part-9a-coordinator-jobs.html