Support Questions

Find answers, ask questions, and share your expertise

datasets and output input events : what correlation between YEAR/MONTH/DAY and the instance ?

avatar

Maybe it is obvious but I was wondering :

When we declare a dataset, based on the date ($YEAR/$MONTH/$DAY/data for example) as an output-events, and used from an input-events where "instance" will watch at current(0) :

Does the dated directory name is directly used to check the input event, or is there a kind of database that register that inside Oozie ? In other words, if we don't mention the output-events and create the "good" directory, will it still working ?

1 ACCEPTED SOLUTION

avatar

Yess it will

1. Generally at ingestion stage data is collected at minute, hourly or daily level.

2. To keep data together based on timestamp, one follow "hdfs path" naming convention as /a/b/b/yyyy/mm/dd

3. the job which consumes this data for performing ETL , needs to choose a range of this path like a week , or a month etc hence datasets have YYYY/MM/DD as the variable param in them .

View solution in original post

2 REPLIES 2

avatar

Yess it will

1. Generally at ingestion stage data is collected at minute, hourly or daily level.

2. To keep data together based on timestamp, one follow "hdfs path" naming convention as /a/b/b/yyyy/mm/dd

3. the job which consumes this data for performing ETL , needs to choose a range of this path like a week , or a month etc hence datasets have YYYY/MM/DD as the variable param in them .

avatar

thanks 🙂