03-06-2017 04:41 PM
I have a shell script in hdfs. I would like to schedule this script in oozie. This script take inputs from another source file.
For this sqoop_import.sh script I will pass the table names as arguments. There are 1500 tables So I want to execute them in parallel.
`Tables`
123_abc
234_cde
enf_7yui
and so on
so, Using linux command split I split the file which contains the 1500 table names into 20 small files. So I can execute 20 jobs in parallel.
The scripts are as follows:
`Sqoop_import.sh`
#!/bin/bash
#This script is to import tables from mysql to hdfs
source /home/$USER/mysql/source.sh
[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }
table=$1
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
sqoop import -D mapreduce.map.memory.mb=3584 -D mapreduce.map.java.opts=-Xmx2868m --connect ${domain}:${port}/${database} --username ${username} --password ${password} --query "select * from ${table} where \$CONDITIONS" -m 1 --as-parquetfile --hive-import --hive-database ${hivedatabase} --hive-table ${table} --map-column-java Date=String --target-dir /user/hive/warehouse/${hivedatabase}.db/${table} --outdir /home/$USER/logs/outdir
g_STATUS=$?
log_status $g_STATUS "SQOOP import ${table}"
echo "*********************************************************************************************************************************************************************************"
Here is the `source.sh` file
domain=jdbc:mysql://XXXXXXXXX
port=3306
database=testing
username=xxxxxx
password=xxxxxxx
hivedatabase=testing
Now I want schedule this job in oozie. I gave the shell script path in workflow.xml and with all job properties.
I am confused how I can pass 20 arguments to the same script for the workflow to execute in parrallel.
Moreover I want the soucre.sh file contents to be passed to the shell script.
This script works fine in linux cron jobs. I just want to use oozie from now on.
I would appreciate a explanation for the answers.
03-21-2017 03:34 AM
You'd be better off to create a Sqoop action and generate a workflow.xml that has several Sqoop actions with paralell executions (using fork and join nodes) inside an Oozie workflow.