Support Questions

Find answers, ask questions, and share your expertise

Details on Nifi Fault Tolerance

avatar
Expert Contributor

How does Nifi Fault Tolerance and more importantly FlowFile acking work?

Specifically looking at Nifi from a streaming background:

  • How does guaranteed FlowFile processing work? On a single node it's based on WAL but if there's no replication across Nifi nodes; how will other nodes know where to resume processing from?
  • How can failover with guaranteed processing be implemented using existing Processors?

This post goes high level on but doesn't talk about specifics of how the failover will work. https://community.hortonworks.com/questions/46887/is-nifi-fault-tolerant-against-machine-failures.ht...

1 ACCEPTED SOLUTION

avatar
Master Mentor

@ambud.sharma

Each Node in a NiFi cluster runs its own copy of the dataflow and works on its own set of FlowFiles. Node A for example is unaware of the existence of Node B.

NiFi does persist all FlowFiles (attributes and content) in to local repositories on each node in the cluster. That is why is is important to make these repo fault tolerant (For example using RAID 10 Disk for your repos).

Should a node go down, as long as you have access to those repos and copy of the flow.xml.gz, you can recover your dataflow where it left off, even if that means spinning up a new NiFi and pointing it at those existing repos.

NiFi comes with no automated built in process for this.

While Nodes at this current time are not aware of other nodes or the data the currently have queued, This is a roadmap item for a future version of NiFi. At this time the HA Data plane stuff has not been committed to any particular release to the best of my knowledge.

Thanks,

Matt

View solution in original post

7 REPLIES 7

avatar
Rising Star

Hi Ambud,

This section in the Storm documentation provides an excellent explanation on how acking and fault tolerance work together to guarantee at least once processing:

http://storm.apache.org/releases/1.0.2/Guaranteeing-message-processing.html

avatar
Rising Star

Sorry, misread your question as Storm fault tolerance, not NiFi.

avatar
Expert Contributor

@Ryan Templeton thanks for your answer Ryan; the question is how it works in Nifi not Storm.

avatar
Master Mentor

@ambud.sharma

Each Node in a NiFi cluster runs its own copy of the dataflow and works on its own set of FlowFiles. Node A for example is unaware of the existence of Node B.

NiFi does persist all FlowFiles (attributes and content) in to local repositories on each node in the cluster. That is why is is important to make these repo fault tolerant (For example using RAID 10 Disk for your repos).

Should a node go down, as long as you have access to those repos and copy of the flow.xml.gz, you can recover your dataflow where it left off, even if that means spinning up a new NiFi and pointing it at those existing repos.

NiFi comes with no automated built in process for this.

While Nodes at this current time are not aware of other nodes or the data the currently have queued, This is a roadmap item for a future version of NiFi. At this time the HA Data plane stuff has not been committed to any particular release to the best of my knowledge.

Thanks,

Matt

avatar
Expert Contributor

So the repo needs to be a shared (like NFS) storage between Nifi nodes?

avatar
Rising Star

Flow files content is written in the content repository. When a node goes down, NiFi cluster manager will route the data to another node. However, NiFi does not replicate data like Kafka. The queued data for the failed node will still be queued for failed node. Only that data must be manually sent over to the live node in the cluster or just bring the failed node up. Any new data, will automatically be routed to other nodes in the cluster by NiFi Cluster Manager (NCM).

http://alvincjin.blogspot.com/2016/08/fault-tolerant-in-apache-nifi-07-cluster.html

avatar
Master Mentor

Only the NiFi 0.x or HDF 1.x versions of NiFi use a NCM. NiFI 1.x or HDF 2.x versions have moved to zero master clustering and do not have an NCM anymore (HA control plane).

The routing of data you are referring to is specific to data being sent to your NiFi cluster via Site-to-Site (S2S). S2S does make sure that data continues to route to only the available destination nodes.

Matt