Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Failover mechanism in nifi

avatar
Rising Star

How to achieve process failover, processgroup failover and node level failover in NiFi?

1 ACCEPTED SOLUTION

avatar
Super Mentor

@spdvnz

NiFi Processors:

NiFi processor components that are likely to encounter failures will have a "failure" routing relationship. Often times failure is handled by looping that failure relationship back on the same processor so that the operation against the failed FlowFile will be re-attempted after a FlowFile penalty duration has expired (default 30 secs). However, you may also which to route failure through additional processors. For example maybe you failure is in a PutSFTP processor configured to send data to system "abc". Instead of looping the failure relationship, you could route failure to a second PutSFTP processor configured to send to a alternate destination server "xyz". Failure from the second PutSFTp could be routed back to the first PutSFTP. In this scenario, a complete failure only occurs delivery to both systems fails.

NiFI Process Groups:

I am not sure what failover condition at a process group level you are trying to account for here. Process groups are nothing more then a logical container of individual dataflows. failover would still be handled through dataflow design.

NiFi Node level failover:

In a NiFi cluster, there is always a "cluster coordinator" and a "primary node" elected. The "primary node" will run all processors that are configured with "on primary node" only on their scheduling tab. Should the cluster coordinator stop receiving heartbeats from the current primary node, a new node will be designated as the primary node and will start the "on primary node" processors.

If the "cluster coordinator" is lost, a new cluster coordinator will be elected and will assume the role of receiving heartbeats from other nodes.

A node that has become disconnected from the cluster will continue to run its dataflows as long as NiFi is still running.

NiFI FlowFile failover between nodes.

Each node in a NiFi cluster is responsible for all the FlowFiles it is currently working on. Each node has no knowledge of what FlowFiles are currently queued on any other node in the cluster. If a NiFi node is completely down, the FlowFiles that it had queued at the time of failure will remain in its repos until the NiFi is brought back online.

The content and FlowFile repositories are not locked to a specific NiFi instance. While you cannot merge these repositories with existing repos of another node, It is possible to standup an entirely new NiFi node and have it use these repositories from the down node to pick up operation where it left off. So it is important to protect the FlowFile and Content repositories via RAID so that disk failure does not result in data loss.

Data HA across NiFi nodes is a future roadmap item.

Thanks,

Matt

View solution in original post

1 REPLY 1

avatar
Super Mentor

@spdvnz

NiFi Processors:

NiFi processor components that are likely to encounter failures will have a "failure" routing relationship. Often times failure is handled by looping that failure relationship back on the same processor so that the operation against the failed FlowFile will be re-attempted after a FlowFile penalty duration has expired (default 30 secs). However, you may also which to route failure through additional processors. For example maybe you failure is in a PutSFTP processor configured to send data to system "abc". Instead of looping the failure relationship, you could route failure to a second PutSFTP processor configured to send to a alternate destination server "xyz". Failure from the second PutSFTp could be routed back to the first PutSFTP. In this scenario, a complete failure only occurs delivery to both systems fails.

NiFI Process Groups:

I am not sure what failover condition at a process group level you are trying to account for here. Process groups are nothing more then a logical container of individual dataflows. failover would still be handled through dataflow design.

NiFi Node level failover:

In a NiFi cluster, there is always a "cluster coordinator" and a "primary node" elected. The "primary node" will run all processors that are configured with "on primary node" only on their scheduling tab. Should the cluster coordinator stop receiving heartbeats from the current primary node, a new node will be designated as the primary node and will start the "on primary node" processors.

If the "cluster coordinator" is lost, a new cluster coordinator will be elected and will assume the role of receiving heartbeats from other nodes.

A node that has become disconnected from the cluster will continue to run its dataflows as long as NiFi is still running.

NiFI FlowFile failover between nodes.

Each node in a NiFi cluster is responsible for all the FlowFiles it is currently working on. Each node has no knowledge of what FlowFiles are currently queued on any other node in the cluster. If a NiFi node is completely down, the FlowFiles that it had queued at the time of failure will remain in its repos until the NiFi is brought back online.

The content and FlowFile repositories are not locked to a specific NiFi instance. While you cannot merge these repositories with existing repos of another node, It is possible to standup an entirely new NiFi node and have it use these repositories from the down node to pick up operation where it left off. So it is important to protect the FlowFile and Content repositories via RAID so that disk failure does not result in data loss.

Data HA across NiFi nodes is a future roadmap item.

Thanks,

Matt