Created on 04-14-2016 06:02 AM - edited 08-17-2019 12:47 PM
Here is an example of scheduling Oozie co-ordinator based on input data events. it starts Oozie workflow when input data is available.
In this example coordinator will start at 2016-04-10, 6:00 GMT and will keep running till 2017-02-26, 23:25GMT (please note start and end time in xml file)
start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT"
Frequency is 1 day
frequency="${coord:days(1)}"
Below ETL function gives same value as start time which means coordinator will look for input data which has value same as start data in /user/root/output/YYYYMMDD format
<instance>${coord:current(0)}</instance>
Below are the working configuration files.
coordinator.xml:
<coordinator-app name="test" frequency="${coord:days(1)}" start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT" xmlns="uri:oozie:coordinator:0.2"> <datasets> <dataset name="inputdataset" frequency="${coord:days(1)}" initial-instance="2016-04-10T06:00Z" timezone="GMT"> <uri-template>${nameNode}/user/root/input/${YEAR}${MONTH}${DAY}</uri-template> <done-flag></done-flag> </dataset> <dataset name="outputdataset" frequency="${coord:days(1)}" initial-instance="2016-04-10T06:00Z" timezone="GMT"> <uri-template>${nameNode}/user/root/output/${YEAR}${MONTH}${DAY}</uri-template> <done-flag></done-flag> </dataset> </datasets> <input-events> <data-in name="inputevent" dataset="inputdataset"> <instance>${coord:current(0)}</instance> </data-in> </input-events> <output-events> <data-out name="outputevent" dataset="outputdataset"> <instance>${coord:current(0)}</instance> </data-out> </output-events> <action> <workflow> <app-path>${workflowAppUri}</app-path> <configuration> <property> <name>inputDir</name> <value>${coord:dataIn('inputevent')}</value> </property> <property> <name>outputDir</name> <value>${coord:dataOut('outputevent')}</value> </property> </configuration> </workflow> </action> </coordinator-app>
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>${myscript}</exec> <argument>${inputDir}</argument> <argument>${outputDir}</argument> <file>${myscriptPath}</file> <capture-output/> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <kill name="fail-output"> <message>Incorrect output, expected [Hello Oozie] but was [${wf:actionData('shell-node')['my_output']}]</message> </kill> <end name="end"/> </workflow-app>
job.properties
nameNode=hdfs://sandbox.hortonworks.com:8020 start=2016-04-12T06:00Z end=2017-02-26T23:25Z jobTracker=sandbox.hortonworks.com:8050 queueName=default examplesRoot=examples oozie.coord.application.path=${nameNode}/user/root workflowAppUri=${oozie.coord.application.path} myscript=myscript.sh myscriptPath=${oozie.wf.application.path}/myscript.sh
myscript.sh
#!/bin/bash echo "I'm receiving input as $1" > /tmp/output echo "I can store my output at $2" >> /tmp/output
How to schedule this?
1. Edit above files as per your environment.
2. Validate your workflow.xml and cordinator.xml files using below command
#oozie validate workflow.xml #oozie validate cordinator.xml
3. Upload your script and these xml files to oozie.coord.application.path and workflowAppUri mentioned in the job.properties
4. Submit coordinator using below command.
oozie job -oozie http://<oozie-server>:11000/oozie -config $local/path/job.properties -run
Note - You will see that some coordinator actions are in WAITING state, that's because they are still waiting for input data to be available on hdfs
If you check /var/log/oozie.log and grep for WAITING coordinator actions:
2016-04-14 05:54:05,850 INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@3] [0000038-160408193600784-oozie-oozi-C@3]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160412 is Missing. [..] 2016-04-14 05:54:15,601 INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@4] [0000038-160408193600784-oozie-oozi-C@4]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160413 is Missing.
On HDFS:
[root@sandbox coord]# hadoop fs -ls /user/root/input/ Found 3 items -rw-r--r-- 3 root hdfs 0 2016-04-13 13:16 /user/root/input/20160410 drwxr-xr-x - root hdfs 0 2016-04-13 13:07 /user/root/input/20160411
Output:
[root@sandbox coord]# cat /tmp/output I'm receiving input as hdfs://sandbox.hortonworks.com:8020/user/root/input/20160411 I can store my output at hdfs://sandbox.hortonworks.com:8020/user/root/output/20160411
Created on 08-19-2016 08:46 AM
HI Kuldeep,
Thanks for the post. Need one help. I am trying to run the above example and checking for previous day date in input-events
using <instance>${coord:current(-1)}</instance>
But it is failing. When I use <instance>${coord:current(0)}</instance> then it runs successfully .
here is my dryrun oozie output. Please help with hints/suggestions
***coordJob after parsing: ***<coordinator-app xmlns="uri:oozie:coordinator:0.1" name="my_Scheduler_5f" frequency="1" start="2016-08-17T23:40Z" end="2016-08-19T23:45Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <controls> <timeout>30</timeout> </controls> <input-events> <data-in name="coordInput_1" dataset="input1"> <dataset name="input1" frequency="1" initial-instance="2016-08-17T00:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <uri-template>${nameNode}/myHdfsPath/Finalpath1/${YEAR}${MONTH}${DAY}/00/</uri-template> <done-flag>_Complete</done-flag> </dataset> <instance>${coord:current(-1)}</instance> </data-in> <data-in name="coordInput_2" dataset="input2"> <dataset name="input2" frequency="1" initial-instance="2016-08-17T23:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <uri-template>${nameNode}/myHdfsPath/Finalpath2/${YEAR}${MONTH}${DAY}/00/</uri-template> <done-flag>_Complete</done-flag> </dataset> <instance>${coord:current(-1)}</instance> </data-in> </input-events> <action> <workflow> <app-path>${nameNode}/myHdfsPath/My_POC/wf-app-dir</app-path> <configuration> <property> <name>date</name> <value>${coord:formatTime(coord:dateOffset(coord:actualTime(),-1,'DAY'), "yyyyMMdd")}</value> </property> </workflow> </action></coordinator-app>***actions for instance***
Question with full details.