Created 04-26-2016 09:22 PM
I have read apache falcon late arriving data documentation; however, I need to further understand how it is handles data on the target cluster side. If late arriving data is detected (source cluster) is the data on the target location deleted/refreshd or simply appended?
Created 05-06-2016 03:29 AM
The supported policies for late data handling are:
For example, a late cut-off of hours (8) means data can be delayed by up to 8 hours:
<late-arrival cut-off="hours(6)”/>The, late input in the following process specification is handled by the /apps/myapp/latehandle workflow:
<late-process policy="exp-backoff" delay="hours(2)”>
<late-input input="input" workflow-path="/apps/myapp/latehandle" />
</late-process>
So this means that for 8 hours till feed arrives the workflow will be retried. Once the feed arrives within that window, the window will be reset.
Now inside /apps/myapp/latehandle you can put your own logic (It may be a sqoop/hive/shell etc etc). The processing here will determine what will happen to that late feed. For simplified scenarios we can run the actual workflow or might modify for a special workflow which handles the dependencies and boundary cases.
Thanks
Created 04-26-2016 10:04 PM
If the input data arrive late but within the cutoff time (defined in feed), Falcon will rerun the instance and update the output. If the input data arrive later than cutoff time, Falcon will not rerun but mark the instance as timeout.
Created 04-26-2016 10:07 PM
Whether the output will be deleted/refreshed or simply appended depends on the process defined by the user. Falcon just reruns the process instance with the late-arriving input data.
Created 04-27-2016 08:52 PM
Created 04-27-2016 08:52 PM
Created 05-06-2016 03:29 AM
The supported policies for late data handling are:
For example, a late cut-off of hours (8) means data can be delayed by up to 8 hours:
<late-arrival cut-off="hours(6)”/>The, late input in the following process specification is handled by the /apps/myapp/latehandle workflow:
<late-process policy="exp-backoff" delay="hours(2)”>
<late-input input="input" workflow-path="/apps/myapp/latehandle" />
</late-process>
So this means that for 8 hours till feed arrives the workflow will be retried. Once the feed arrives within that window, the window will be reset.
Now inside /apps/myapp/latehandle you can put your own logic (It may be a sqoop/hive/shell etc etc). The processing here will determine what will happen to that late feed. For simplified scenarios we can run the actual workflow or might modify for a special workflow which handles the dependencies and boundary cases.
Thanks