Support Questions

vamsi123 · ‎12-15-2016

I am new to oozie and can anyone explain what is the purpouse of input-events and output-events.I have gone through the manual but still it is not clear.can anyone explain input-event in below case.

<coordinator-app name="MY_APP" frequency="1440" start="2009-02-01T00:00Z" end="2009-02-07T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <datasets>
      <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
         <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
      </dataset>
   </datasets>
   <input-events>
      <data-in name="coordInput1" dataset="input1">
          <start-instance>${coord:current(-23)}</start-instance>
          <end-instance>${coord:current(0)}</end-instance>
      </data-in>
   </input-events>
   <action>
      <workflow>
         <app-path>hdfs://localhost:9000/tmp/workflows</app-path>
      </workflow>
   </action>     
</coordinator-app>

egarelnabi · ‎12-16-2016

@vamsi valiveti

Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow.

“dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml.

Dataset:

A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”.

If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}”

<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">         
	<uri-template>
		hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}
   	</uri-template>      
</dataset>

Input-Event:

An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running.

We specify the time window to look for these data instances by using the “start-instance” and the “end-instance”

So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago)

<input-events>      
	<data-in name="coordInput1" dataset="input1">          
		<start-instance>${coord:current(-23)}</start-instance>          
		<end-instance>${coord:current(0)}</end-instance>      
	</data-in>   
</input-events>

Output-Event:

An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.

View solution in original post

egarelnabi · ‎12-16-2016

@vamsi valiveti

Oozie can materialize coordinator actions (i.e. start tasks/jobs) based on time-based intervals or triggers. For example run Job X every day at 12pm. However, time is not always the only dependency. Sometimes we may want to start a job after all the necessary data is available. So, Oozie coordintor allows us to use both, time and data dependencies, to start a workflow.

“dataset”, “input-events” and “output-events” are the pillars for configuring data dependencies in coordinator.xml.

Dataset:

A “dataset” is essentially an entity that represents data produced by an application and is often defined using it’s directory location. When data is ready to be consumed a file named “_SUCCESS” is added to the folder by default. Alternatively, we can specify the name of the file we want to write instead of “_SUCCESS” by setting the “done-flag”.

If we look at your config example you have above, we expect a new dataset to be generated every 60 minutes and the folder will be “tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}”

<dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">         
	<uri-template>
		hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}
   	</uri-template>      
</dataset>

Input-Event:

An “input-event” describes the necessary instances of data to start a job. For example, when a coordinator runs, the input-event can be used to check if a “_SUCCESS” has been posted in the last hour, and process the data for that. If nothing matches this criteria, or all the data is more than an hour old, then the job is not executed. Another example is to wait until several files/data instances have been completed before running.

We specify the time window to look for these data instances by using the “start-instance” and the “end-instance”

So, in the example you have above, it specifies we should process files for the last 24 hours (“current(0)” being the current hour and “current(-23)” being 23 hours ago)

<input-events>      
	<data-in name="coordInput1" dataset="input1">          
		<start-instance>${coord:current(-23)}</start-instance>          
		<end-instance>${coord:current(0)}</end-instance>      
	</data-in>   
</input-events>

Output-Event:

An “output-event” is the opposite of an “input-event”. It is the output file the job write once it completes. This file can be used as the “input-event” for another coordinator.

vamsi123 · ‎12-16-2016

Hi @Eyad Garelnabi

Thanks for Input regarding current(0) one clarification is there.When i check oozie textbook below formula is present

current(n) = dsII + dsF * (n + (caNT – dsII) / dsF)

current(0) = dsII + dsF * (n + (caNT – dsII) / dsF)
= 2014-10-06T06:00Z + 3 day x
(0 + (2014-10-19T06:00Z - 2014-10-06T06:00Z))/ 3 day

= 2014-10-06T06:00Z + 3 day *(13)/3

= 2014-10-06T06:00Z +(13)=2014-10-19T06:00Z
but when i check textbook page 127 they mentioned as 2014-10-18T06:00Z not sure what i am missing.

egarelnabi · ‎12-16-2016

In your calculation, the initial instance (dsII) is 2014-10-06T06:00Z, the frequency (dsF) is 3 days, and the coordinator's nominal time (caNT) is 2014-10-19T06:00Z.

Using that information, you'll have data instances for 2014-10-06T06:00Z, 2014-10-09T06:00Z, 2014-10-12T06:00Z, 2014-10-15T06:00Z, 2014-10-18T06:00Z. The next data instance will occur at 2014-10-21T06:00Z which is after the caNT.

So, the last "useable" data instance will occur at 2014-10-18T06:00Z

vamsi123 · ‎12-16-2016

Hi @Eyad Garelnabi

Thanks it answers my question but in oozie textbook they mentioned we can calculate current(0) using formulae current(0) = dsII + dsF * (0 + (caNT – dsII) / dsF).

What is the problem with my calculation since i am not able to get the 2014-10-18T06:00Z with the formalue.

egarelnabi · ‎12-16-2016

You are not doing anything wrong and neither is the book. The limitation iswith the formula itself.

This formula does not account for scenarios where [ (caNT-dsII)/dsF ] leads to fractions. In such situations, the caNT will not match current(0) through calculation without eyeballing it.

If you take a look at the text book it says “Notably, the nominal time 2014-10-19T06:00Z and current(0) do not exactly match in this example”

Cloudera Community

Support Questions

oozie Input events clarification

Dataset:

Input-Event:

Output-Event:

Dataset:

Input-Event:

Output-Event:

Oozie coordinator and based on input data events

Best in Flow Event

Clarification on Cloudera Navigator Availability

How to Spark Roll Event Log Files in CDP

Change spark history event log location.

Hive Bucket clarification

Clarifications on LDAP integration

Event driven pipelines in Azure with CDE(Cloudera ...

impala vs hive 3 clarifications

Spark Learning Clarification