Created 12-15-2016 02:14 PM
I am new to oozie and can anyone explain what is the purpouse of input-events and output-events.I have gone through the manual but still it is not clear.can anyone explain input-event in below case.
<coordinator-app name="MY_APP" frequency="1440" start="2009-02-01T00:00Z" end="2009-02-07T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <start-instance>${coord:current(-23)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>
Created 12-16-2016 06:56 AM
Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow.
“dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml.
A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”.
If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}”
<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template> hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset>
An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running.
We specify the time window to look for these data instances by using the “start-instance” and the “end-instance”
So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago)
<input-events> <data-in name="coordInput1" dataset="input1"> <start-instance>${coord:current(-23)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events>
An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.
Created 12-16-2016 06:56 AM
Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow.
“dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml.
A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”.
If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}”
<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template> hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset>
An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running.
We specify the time window to look for these data instances by using the “start-instance” and the “end-instance”
So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago)
<input-events> <data-in name="coordInput1" dataset="input1"> <start-instance>${coord:current(-23)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events>
An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.
Created 12-16-2016 12:21 PM
Thanks for Input regarding current(0) one clarification is there.When i check oozie textbook below formula is present
current(n) = dsII + dsF * (n + (caNT – dsII) / dsF)
current(0) = dsII + dsF * (n + (caNT – dsII) / dsF) = 2014-10-06T06:00Z + 3 day x (0 + (2014-10-19T06:00Z - 2014-10-06T06:00Z))/ 3 day = 2014-10-06T06:00Z + 3 day *(13)/3 = 2014-10-06T06:00Z +(13)=2014-10-19T06:00Z but when i check textbook page 127 they mentioned as 2014-10-18T06:00Z not sure what i am missing.
Created 12-16-2016 03:10 PM
In your calculation, the initial instance (dsII) is 2014-10-06T06:00Z, the frequency (dsF) is 3 days, and the coordinator's nominal time (caNT) is 2014-10-19T06:00Z.
Using that information, you'll have data instances for 2014-10-06T06:00Z, 2014-10-09T06:00Z, 2014-10-12T06:00Z, 2014-10-15T06:00Z, 2014-10-18T06:00Z. The next data instance will occur at 2014-10-21T06:00Z which is after the caNT.
So, the last "useable" data instance will occur at 2014-10-18T06:00Z
Created 12-16-2016 03:19 PM
Thanks it answers my question but in oozie textbook they mentioned we can calculate current(0) using formulae current(0) = dsII + dsF * (0 + (caNT – dsII) / dsF).
What is the problem with my calculation since i am not able to get the 2014-10-18T06:00Z with the formalue.
Created 12-16-2016 03:33 PM
You are not doing anything wrong and neither is the book. The limitation iswith the formula itself.
This formula does not account for scenarios where [ (caNT-dsII)/dsF ] leads to fractions. In such situations, the caNT will not match current(0) through calculation without eyeballing it.
If you take a look at the text book it says “Notably, the nominal time 2014-10-19T06:00Z and current(0) do not exactly match in this example”