Support Questions

Find answers, ask questions, and share your expertise

NiFi best practices for error handling

avatar
Expert Contributor

Hi All,

I would appreciate if you guys can point me to where I can find best practices for error handling in NiFi. Below is how I'm envisioning handling errors in my workflows. Would you suggest any enhancements or better ways to do it.

My error handling requirements are simple, basically to log the errored flow files to the file system and send an alert; so all the processors in the dataflow that have a "failure" relationship would send the failed flowfiles to a funnel and from there they would go to an error handling Process group, which does the logging and alerting.

Thanks

11307-example-of-error-handling.png

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Raj B

Not all Errors are equal. I would avoid lumping all failure relationships into the same error handling strategy. Some Errors are no surprise, can be expected to occur on occasion, and may be a one time thing that resolves itself.

Lets use your example above....

The putHDFS processor is likely to experience some failure over time do to events outside of NiFi's control. For example, let say a file in the middle of transferring to HDFs when the network connection is lost. NIFi would in turn route that FlowFile to failure. If that failure relationship had been routed back on the putHDFS, it would have likely been successful on the subsequent attempt. A better error handling strategy in this case may be to build a simple error handling flow that can be used when the type of failure might lead to self resolution.

11306-screen-shot-2017-01-10-at-42513-pm.png

So here you see Failed FlowFiles enter at "data", they are then checked for a failure counter and if one does not exist it is created and set to 1. If it exists, it is incremented by 1. The check recount count will continue to pass the file to "retry" until the same file has been seen x number of times. "Retry" would be routed back to the source processor of the failure. after x attempts the counter is reset, an email is sent, and the file is place in some local error directory for manual intervention.

https://cwiki.apache.org/confluence/download/attachments/57904847/Retry_Count_Loop.xml?version=1&mod...

The other scenario is where the type of failure is not likely to ever correct itself. Your mergeContent processor is a good example here. If the processor failed to merge some FlowFiles, it is extremely likely to happen again, so there is little benefit in looping this failure relationship back on the processor like we did above. In this case you may want to route this processors failure to a putEmail processor to notify the end user of the failure and where it occurred in the dataflow. The success of the putEmail processor may just feed another processor such as UpdateAttribute which is in a stopped/disabled state. This will hold the data in the dataflow until manually intervention can be taken to identify the issue and either reroute the data back in to the flow once corrected or discard the data. If there is concern over available space in your NiFi Content repository, i would some processor to write it out to a different error file location using putFile, PutHDFS, PutSFTP, etc...

Hope this helps,

Matt

View solution in original post

10 REPLIES 10

avatar
New Contributor

In my experience, the retry loop pattern has led to a deadlock when the number of failed flow files approaches the size of the queues entering and exiting the loop. The issue is that both of those queues fill up, and NiFi is not able to pick up on the fact that it can cycle flowfiles between the queues. This also has the unfortunate side effect of blocking any flowfiles from entering the initial error-prone processor, as they could potentially fail, and enter a full queue. The best solution I have as of now is to simply duplicate any failure prone processors, considering I usually only want to retry 1-3 times.

See:

http://apache-nifi-users-list.2361937.n4.nabble.com/Back-pressure-deadlock-td3274.html