Support Questions
Find answers, ask questions, and share your expertise

How does falcon handle late arriving data on target cluster?

Super Guru

I have read apache falcon late arriving data documentation; however, I need to further understand how it is handles data on the target cluster side. If late arriving data is detected (source cluster) is the data on the target location deleted/refreshd or simply appended?

1 ACCEPTED SOLUTION

Accepted Solutions

@Sunile Manjee

The supported policies for late data handling are:

  • backoff: Take the maximum late cut-off and check every specified time.
  • exp-backoff (default): Recommended. Take the maximum cut-off date and check on an exponentially determined time.
  • final:Take the maximum late cut-off and check once.

For example, a late cut-off of hours (8) means data can be delayed by up to 8 hours:

<late-arrival cut-off="hours(6)”/>

The, late input in the following process specification is handled by the /apps/myapp/latehandle workflow:

<late-process policy="exp-backoff" delay="hours(2)”>

<late-input input="input" workflow-path="/apps/myapp/latehandle" />

</late-process>

So this means that for 8 hours till feed arrives the workflow will be retried. Once the feed arrives within that window, the window will be reset.

Now inside /apps/myapp/latehandle you can put your own logic (It may be a sqoop/hive/shell etc etc). The processing here will determine what will happen to that late feed. For simplified scenarios we can run the actual workflow or might modify for a special workflow which handles the dependencies and boundary cases.

Thanks

View solution in original post

5 REPLIES 5

Explorer

If the input data arrive late but within the cutoff time (defined in feed), Falcon will rerun the instance and update the output. If the input data arrive later than cutoff time, Falcon will not rerun but mark the instance as timeout.

Explorer

Whether the output will be deleted/refreshed or simply appended depends on the process defined by the user. Falcon just reruns the process instance with the late-arriving input data.

New Contributor
Whether the output will be deleted/refreshed or simply appended depends on the process defined by the user. Falcon just reruns the process instance with the late-arriving input data. nice

New Contributor
Hortonworks awesome thanks

@Sunile Manjee

The supported policies for late data handling are:

  • backoff: Take the maximum late cut-off and check every specified time.
  • exp-backoff (default): Recommended. Take the maximum cut-off date and check on an exponentially determined time.
  • final:Take the maximum late cut-off and check once.

For example, a late cut-off of hours (8) means data can be delayed by up to 8 hours:

<late-arrival cut-off="hours(6)”/>

The, late input in the following process specification is handled by the /apps/myapp/latehandle workflow:

<late-process policy="exp-backoff" delay="hours(2)”>

<late-input input="input" workflow-path="/apps/myapp/latehandle" />

</late-process>

So this means that for 8 hours till feed arrives the workflow will be retried. Once the feed arrives within that window, the window will be reset.

Now inside /apps/myapp/latehandle you can put your own logic (It may be a sqoop/hive/shell etc etc). The processing here will determine what will happen to that late feed. For simplified scenarios we can run the actual workflow or might modify for a special workflow which handles the dependencies and boundary cases.

Thanks

View solution in original post