Created 10-04-2017 04:39 PM
Maybe it is obvious but I was wondering :
When we declare a dataset, based on the date ($YEAR/$MONTH/$DAY/data for example) as an output-events, and used from an input-events where "instance" will watch at current(0) :
Does the dated directory name is directly used to check the input event, or is there a kind of database that register that inside Oozie ? In other words, if we don't mention the output-events and create the "good" directory, will it still working ?
Created 10-07-2017 03:59 PM
Yess it will
1. Generally at ingestion stage data is collected at minute, hourly or daily level.
2. To keep data together based on timestamp, one follow "hdfs path" naming convention as /a/b/b/yyyy/mm/dd
3. the job which consumes this data for performing ETL , needs to choose a range of this path like a week , or a month etc hence datasets have YYYY/MM/DD as the variable param in them .
Created 10-07-2017 03:59 PM
Yess it will
1. Generally at ingestion stage data is collected at minute, hourly or daily level.
2. To keep data together based on timestamp, one follow "hdfs path" naming convention as /a/b/b/yyyy/mm/dd
3. the job which consumes this data for performing ETL , needs to choose a range of this path like a week , or a month etc hence datasets have YYYY/MM/DD as the variable param in them .
Created 10-09-2017 07:24 AM
thanks 🙂