Support Questions

Find answers, ask questions, and share your expertise

Questions on data recovery/integrity after standalone NiFi instance crash

avatar
Expert Contributor

Hello,

I have couple of questions on data recovery/integrity after a standlaone NiFi instance crashes (and is subsequently restarted); I searched here in HCC about NiFi fault-tolerance, the few posts that I looked at talk about fault-tolerance and recovery in a cluster environment, but I'm interested in a standalone NiFi scenario.

In our case the standalone NiFi instance crashed because the Java heap ran out of space (but we do have enough heap space allocated); at the time of the crash one dataflow was running a real-time stream without any issues, but when a second dataflow, that had ListSFTP and FetchSFTP processors, started running (there is a known issue with running List and Fetch processors when there are lots of files to process - https://issues.apache.org/jira/browse/NIFI-3423), some of the NiFi processors in the first dataflow started to throw out-of-memory errors and the ListenTCP processor stopped ingesting new flowfiles from the source system; and, on the server the Java process CPU utilization was something like 250%; at that point, we stopped both dataflows (the NiFi canvas was still accessible, it did not crash) and restarted the NiFi instance; after that, we resumed the first dataflow and all was fine;

Since NiFi writes flowfile content to content repository and keeps attributes and state info in the flowfile repository, no data should have been lost or corrupted when NiFi instance crashed. Just wanted to clarify if my understanding in this scenario (no loss of data or integrity issues) is correct.

2nd question is, during a planned server reboot, if we stop a dataflow when there is data in transit in the dataflow (i.e. some flowfiles are in queues between processors, some flowfiles are being processed by NiFi processors, e.g. replacetext), then restart the NiFi instance and resume the stopped dataflow, would NiFi pickup where it was left off? i.e, flowfiles that were in the middle of being processed (just prior to the dataflow being stopped) resume their processing from where they were left off? my understanding is yes, that's how NiFi works, the fault tolerance built into NiFi takes care of that. Would like to know if that is correct ?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Guru

I believe your understanding is correct...

The flow file repository is like a transaction log that provides the state of where every flow file is in the flow. The session object provided to processors lets them perform a transaction.

For processors in the middle of the flow, they have a queue leading into them... when they execute they take 1 or more flow files off the queue, operate on them, and transfer them to a relationship. If everything worked successfully then session.commit() is called which updates the flow file repository and places these flow files into their next queue. If an error happened then the session is rollbacked and the flow files end up back in the original queue. If NiFi shuts down or crashes while the processor is operating on flow files but before session.commit, then they end up back in the original queue like any other error.

For source or destination processors, a lot of depends on the source or destination system and what protocol is being used to exchange data. NiFi can only provide guarantees that are as good as the protocol provides. For example, in the ListenTCP case, if NiFi crashes at the exact moment that it has read a message off the socket, but before it has written to a flow file, then this message is lost because as far as the TCP connection is concerned it was successful. This is why application protocols like RELP were built on top of TCP to offer two-phase commits so that the receiver can acknowledge that not only did it read it off the socket, but it also performed any additional operations successfully.

View solution in original post

2 REPLIES 2

avatar
Master Guru

I believe your understanding is correct...

The flow file repository is like a transaction log that provides the state of where every flow file is in the flow. The session object provided to processors lets them perform a transaction.

For processors in the middle of the flow, they have a queue leading into them... when they execute they take 1 or more flow files off the queue, operate on them, and transfer them to a relationship. If everything worked successfully then session.commit() is called which updates the flow file repository and places these flow files into their next queue. If an error happened then the session is rollbacked and the flow files end up back in the original queue. If NiFi shuts down or crashes while the processor is operating on flow files but before session.commit, then they end up back in the original queue like any other error.

For source or destination processors, a lot of depends on the source or destination system and what protocol is being used to exchange data. NiFi can only provide guarantees that are as good as the protocol provides. For example, in the ListenTCP case, if NiFi crashes at the exact moment that it has read a message off the socket, but before it has written to a flow file, then this message is lost because as far as the TCP connection is concerned it was successful. This is why application protocols like RELP were built on top of TCP to offer two-phase commits so that the receiver can acknowledge that not only did it read it off the socket, but it also performed any additional operations successfully.

avatar
Expert Contributor

thank you @Bryan Bende, for the explanation and for clarifying.