Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Master Guru

Here is an example of scheduling Oozie co-ordinator based on input data events. it starts Oozie workflow when input data is available.

In this example coordinator will start at 2016-04-10, 6:00 GMT and will keep running till 2017-02-26, 23:25GMT (please note start and end time in xml file)

  start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT"

Frequency is 1 day

  frequency="${coord:days(1)}"

Below ETL function gives same value as start time which means coordinator will look for input data which has value same as start data in /user/root/output/YYYYMMDD format

          <instance>${coord:current(0)}</instance>

Below are the working configuration files.

coordinator.xml:

<coordinator-app name="test"
  frequency="${coord:days(1)}"
  start="2016-04-10T06:00Z" end="2017-02-26T23:25Z" timezone="GMT"
  xmlns="uri:oozie:coordinator:0.2">
  <datasets>
    <dataset name="inputdataset" frequency="${coord:days(1)}"
             initial-instance="2016-04-10T06:00Z" timezone="GMT">
      <uri-template>${nameNode}/user/root/input/${YEAR}${MONTH}${DAY}</uri-template>
      <done-flag></done-flag>
    </dataset>
    <dataset name="outputdataset" frequency="${coord:days(1)}"
             initial-instance="2016-04-10T06:00Z" timezone="GMT">
      <uri-template>${nameNode}/user/root/output/${YEAR}${MONTH}${DAY}</uri-template>
      <done-flag></done-flag>
    </dataset>
  </datasets>
  <input-events>
      <data-in name="inputevent" dataset="inputdataset">
          <instance>${coord:current(0)}</instance>
      </data-in>
  </input-events>
  <output-events>
      <data-out name="outputevent" dataset="outputdataset">
          <instance>${coord:current(0)}</instance>
      </data-out>
  </output-events>
  <action>
    <workflow>
      <app-path>${workflowAppUri}</app-path>
            <configuration>
                <property>
                    <name>inputDir</name>
                    <value>${coord:dataIn('inputevent')}</value>
                </property>
                <property>
                    <name>outputDir</name>
                    <value>${coord:dataOut('outputevent')}</value>
                </property>
            </configuration>
   </workflow>
  </action>
</coordinator-app>

workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
    <start to="shell-node"/>
    <action name="shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>${myscript}</exec>
    <argument>${inputDir}</argument>
    <argument>${outputDir}</argument>
            <file>${myscriptPath}</file>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <kill name="fail-output">
        <message>Incorrect output, expected [Hello Oozie] but was [${wf:actionData('shell-node')['my_output']}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

job.properties

nameNode=hdfs://sandbox.hortonworks.com:8020
start=2016-04-12T06:00Z
end=2017-02-26T23:25Z
jobTracker=sandbox.hortonworks.com:8050
queueName=default
examplesRoot=examples
oozie.coord.application.path=${nameNode}/user/root
workflowAppUri=${oozie.coord.application.path}
myscript=myscript.sh
myscriptPath=${oozie.wf.application.path}/myscript.sh

myscript.sh

#!/bin/bash
echo "I'm receiving input as $1" > /tmp/output
echo "I can store my output at $2" >> /tmp/output

How to schedule this?

1. Edit above files as per your environment.

2. Validate your workflow.xml and cordinator.xml files using below command

#oozie validate workflow.xml 
#oozie validate cordinator.xml 

3. Upload your script and these xml files to oozie.coord.application.path and workflowAppUri mentioned in the job.properties

4. Submit coordinator using below command.

oozie job -oozie http://<oozie-server>:11000/oozie -config $local/path/job.properties -run

3441-screen-shot-2016-04-14-at-112147-am.png

Note - You will see that some coordinator actions are in WAITING state, that's because they are still waiting for input data to be available on hdfs

3442-screen-shot-2016-04-14-at-112325-am.png

If you check /var/log/oozie.log and grep for WAITING coordinator actions:

2016-04-14 05:54:05,850  INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@3] [0000038-160408193600784-oozie-oozi-C@3]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160412 is Missing.

[..]

2016-04-14 05:54:15,601  INFO CoordActionInputCheckXCommand:520 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000038-160408193600784-oozie-oozi-C] ACTION[0000038-160408193600784-oozie-oozi-C@4] [0000038-160408193600784-oozie-oozi-C@4]::ActionInputCheck:: In checkListOfPaths: hdfs://sandbox.hortonworks.com:8020/user/root/input/20160413 is Missing. 

On HDFS:

[root@sandbox coord]# hadoop fs -ls /user/root/input/
Found 3 items
-rw-r--r--   3 root hdfs          0 2016-04-13 13:16 /user/root/input/20160410
drwxr-xr-x   - root hdfs          0 2016-04-13 13:07 /user/root/input/20160411

Output:

[root@sandbox coord]# cat /tmp/output
I'm receiving input as hdfs://sandbox.hortonworks.com:8020/user/root/input/20160411
I can store my output at hdfs://sandbox.hortonworks.com:8020/user/root/output/20160411
32,029 Views
Comments
avatar
Explorer

HI Kuldeep,

Thanks for the post. Need one help. I am trying to run the above example and checking for previous day date in input-events

using <instance>${coord:current(-1)}</instance>

But it is failing. When I use <instance>${coord:current(0)}</instance> then it runs successfully .

here is my dryrun oozie output. Please help with hints/suggestions

***coordJob after parsing: ***<coordinator-app xmlns="uri:oozie:coordinator:0.1" name="my_Scheduler_5f" frequency="1" start="2016-08-17T23:40Z" end="2016-08-19T23:45Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE">  <controls>    <timeout>30</timeout>  </controls>  <input-events>    <data-in name="coordInput_1" dataset="input1">      <dataset name="input1" frequency="1" initial-instance="2016-08-17T00:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE">        <uri-template>${nameNode}/myHdfsPath/Finalpath1/${YEAR}${MONTH}${DAY}/00/</uri-template>        <done-flag>_Complete</done-flag>      </dataset>      <instance>${coord:current(-1)}</instance>    </data-in>    <data-in name="coordInput_2" dataset="input2">      <dataset name="input2" frequency="1" initial-instance="2016-08-17T23:00Z" timezone="America/Los_Angeles" freq_timeunit="DAY" end_of_duration="NONE">        <uri-template>${nameNode}/myHdfsPath/Finalpath2/${YEAR}${MONTH}${DAY}/00/</uri-template>        <done-flag>_Complete</done-flag>      </dataset>      <instance>${coord:current(-1)}</instance>    </data-in>  </input-events>  <action>    <workflow>      <app-path>${nameNode}/myHdfsPath/My_POC/wf-app-dir</app-path>      <configuration>        <property>          <name>date</name>          <value>${coord:formatTime(coord:dateOffset(coord:actualTime(),-1,'DAY'), "yyyyMMdd")}</value>        </property>    </workflow>  </action></coordinator-app>***actions for instance***

Question with full details.

https://community.hortonworks.com/questions/52412/how-to-configure-oozie-coordinator-dataset-for-pre...