Created on 04-14-2016 06:02 AM - edited 08-17-2019 12:47 PM
Here is an example of scheduling Oozie co-ordinator based on input data events. it starts Oozie workflow when input data is available.
In this example coordinator will start at 2016-04-10, 6:00 GMT and will keep running till 2017-02-26, 23:25GMT (please note start and end time in xml file)
start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT"
Frequency is 1 day
frequency="${coord:days(1)}"
Below ETL function gives same value as start time which means coordinator will look for input data which has value same as start data in /user/root/output/YYYYMMDD format
<instance>${coord:current(0)}</instance>
Below are the working configuration files.
coordinator.xml:
<coordinator-app name="test"
frequency="${coord:days(1)}"
start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT"
xmlns="uri:oozie:coordinator:0.2">
<datasets>
<dataset name="inputdataset" frequency="${coord:days(1)}"
initial-instance="2016-04-10T06:00Z" timezone="GMT">
<uri-template>${nameNode}/user/root/input/${YEAR}${MONTH}${DAY}</uri-template>
<done-flag></done-flag>
</dataset>
<dataset name="outputdataset" frequency="${coord:days(1)}"
initial-instance="2016-04-10T06:00Z" timezone="GMT">
<uri-template>${nameNode}/user/root/output/${YEAR}${MONTH}${DAY}</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="inputevent" dataset="inputdataset">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="outputevent" dataset="outputdataset">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>inputDir</name>
<value>${coord:dataIn('inputevent')}</value>
</property>
<property>
<name>outputDir</name>
<value>${coord:dataOut('outputevent')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${myscript}</exec>
<argument>${inputDir}</argument>
<argument>${outputDir}</argument>
<file>${myscriptPath}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<kill name="fail-output">
<message>Incorrect output, expected [Hello Oozie] but was [${wf:actionData('shell-node')['my_output']}]</message>
</kill>
<end name="end"/>
</workflow-app>job.properties
nameNode=hdfs://sandbox.hortonworks.com:8020
start=2016-04-12T06:00Z
end=2017-02-26T23:25Z
jobTracker=sandbox.hortonworks.com:8050
queueName=default
examplesRoot=examples
oozie.coord.application.path=${nameNode}/user/root
workflowAppUri=${oozie.coord.application.path}
myscript=myscript.sh
myscriptPath=${oozie.wf.application.path}/myscript.shmyscript.sh
#!/bin/bash echo "I'm receiving input as $1" > /tmp/output echo "I can store my output at $2" >> /tmp/output
How to schedule this?
1. Edit above files as per your environment.
2. Validate your workflow.xml and cordinator.xml files using below command
#oozie validate workflow.xml #oozie validate cordinator.xml
3. Upload your script and these xml files to oozie.coord.application.path and workflowAppUri mentioned in the job.properties
4. Submit coordinator using below command.
oozie job -oozie http://<oozie-server>:11000/oozie -config $local/path/job.properties -run
Note - You will see that some coordinator actions are in WAITING state, that's because they are still waiting for input data to be available on hdfs
If you check /var/log/oozie.log and grep for WAITING coordinator actions:
2016-04-14 05:54:05,850 INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@3] [0000038-160408193600784-oozie-oozi-C@3]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160412 is Missing. [..] 2016-04-14 05:54:15,601 INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@4] [0000038-160408193600784-oozie-oozi-C@4]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160413 is Missing.
On HDFS:
[root@sandbox coord]# hadoop fs -ls /user/root/input/ Found 3 items -rw-r--r-- 3 root hdfs 0 2016-04-13 13:16 /user/root/input/20160410 drwxr-xr-x - root hdfs 0 2016-04-13 13:07 /user/root/input/20160411
Output:
[root@sandbox coord]# cat /tmp/output I'm receiving input as hdfs://sandbox.hortonworks.com:8020/user/root/input/20160411 I can store my output at hdfs://sandbox.hortonworks.com:8020/user/root/output/20160411
Created on 08-19-2016 08:46 AM
HI Kuldeep,
Thanks for the post. Need one help. I am trying to run the above example and checking for previous day date in input-events
using <instance>${coord:current(-1)}</instance>But it is failing. When I use <instance>${coord:current(0)}</instance> then it runs successfully .
here is my dryrun oozie output. Please help with hints/suggestions
***coordJob after parsing: ***<coordinator-app xmlns="uri:oozie:coordinator:0.1" name="my_Scheduler_5f" frequency="1" start="2016-08-17T23:40Z" end="2016-08-19T23:45Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <controls> <timeout>30</timeout> </controls> <input-events> <data-in name="coordInput_1" dataset="input1"> <dataset name="input1" frequency="1" initial-instance="2016-08-17T00:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <uri-template>${nameNode}/myHdfsPath/Finalpath1/${YEAR}${MONTH}${DAY}/00/</uri-template> <done-flag>_Complete</done-flag> </dataset> <instance>${coord:current(-1)}</instance> </data-in> <data-in name="coordInput_2" dataset="input2"> <dataset name="input2" frequency="1" initial-instance="2016-08-17T23:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE"> <uri-template>${nameNode}/myHdfsPath/Finalpath2/${YEAR}${MONTH}${DAY}/00/</uri-template> <done-flag>_Complete</done-flag> </dataset> <instance>${coord:current(-1)}</instance> </data-in> </input-events> <action> <workflow> <app-path>${nameNode}/myHdfsPath/My_POC/wf-app-dir</app-path> <configuration> <property> <name>date</name> <value>${coord:formatTime(coord:dateOffset(coord:actualTime(),-1,'DAY'), "yyyyMMdd")}</value> </property> </workflow> </action></coordinator-app>***actions for instance***
Question with full details.