Support Questions

Find answers, ask questions, and share your expertise

Nifi 2.1 Nifi node continued processing flow files in DISCONNECTED state and out of cluster. Is that expected?

avatar
Rising Star

Hi

We are currently operating a three-node cluster using Apache NiFi version 2.1.
During a recent deployment of NiFi flows, we encountered an issue related to replication inconsistencies across the nodes.
This problem resulted in one of the nodes being out of the cluster and remained in disconnected state.
Furthermore, the affected node did not attempt to reconnect to the cluster on its own.

Although the node was disconnected from the cluster, it continued to operate independently, allowing it to consume messages, execute database queries, and perform actions as a standalone instance.

I would like to know two things here

  1. Is this expected behavior for nifi node to be active although it is not in cluster
  2. Can we alter the behavior through configuration to stop processors temporarily while node is not connected to cluster?

Thanks for in advance for your inputs.  Let me know if more details are required.

Thanks

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Master Mentor

@shiva239 

There are numerous things happening when a node is disconnected.

  1. A disconnected node is different then a dropped node.  A cluster node must be disconnected before it can be dropped.  
  2. A node can become disconnected in two ways:
    1. Manually disconnected - User manually disconnects a node via the NiFi Cluster UI.  A manually disconnected node will not attempt to auto-rejoin cluster. A user can manually reconnect the node from another node in the cluster via the same Cluster UI.
    2. A node becomes disconnected due to some issue.  

A node that is disconnected is still part of the cluster until it is dropped.  Once dropped the cluster no longer considers it part of the cluster.  This distinction matter when it comes to load balanced connections that use a load balance strategy other then Round Robin.

  • Load balance connections use NiFi Site-To-Site protocol to move FlowFiles between nodes.  Only Connected nodes are eligible to have FlowFiles sent over Site-To-Site.
  • Even a disconnected node is still able to load-balance FlowFiles to other nodes still connected in the cluster.   (so when you had 1 node disconnect from cluster, if you went to that nodes UI directly the load balanced connection would appear to processing all FlowFiles normally.  This is because the two nodes where it sends some FlowFiles by attribute are still connected and thus it is allowed to send to them.  The other FlowFiles by attribute destined for disconnected node never leave the node and get processed locally.  Over on the still cluster connected nodes the story is different.  They can only send to connected nodes and any FlowFiles destined for that disconnected node will begin to queue.  Even if you stopped the dataflows on the disconnected node the FlowFiles would continue to queue for that node.  So stopping the dataflow on a node that disconnects would still present same issue.
  • A disconnected node is still aware of what node are part of the cluster and can still communicate with ZK to know which node is the elected cluster coordinator.  Lets say a second node disconnects.  The disconnected node would stop attempting to send to that now disconnected node and queue FlowFiles destined for that node.
  • Only the round robin strategy will attempt to redistribute FlowFile to remaining connected nodes when a node becomes disconnected.  The Partition by attribute and Single Node strategies are used when it is important that "like" FlowFiles end up on the same node for downstream processing (So once a like FlowFile, which in your case are FlowFiles with same value in the orderid FlowFile attribute, is marked for node 3, all FlowFiles with that same orderId will queue for node 3 as long as node three is still a member of the cluster.  A disconnected node is still part of the cluster and will have some "like" FlowFiles already on it, so we would not want NiFi to start sending "Like" data to some other node all of a sudden.  Now if manual user action was taken to drop the disconnected node, then the load-balanced connections would start using a different node for the FlowFiles original being allocated to the disconnected node. 

NiFi also offers an off-loading feature. This allows a user with proper authorization to off-load a disconnected node (IMPORTANT: only a reachable and running node can be offloaded successfully. Attempting Offloading of a down or unreachable node will not work).  Once a node is disconnected a user can choose to offload the node this is typical if say a user want to decommission a node in the cluster.  Initiating off-load will send a request to that disconnected node to stop, terminate all running components and then off load the queued FlowFiles to other nodes connected to the cluster.   If cluster nodes where allowed to continue to load-balance to disconnected node(s), this capability would fail as you would end-up with a constant loop of FlowFiles back to disconnected node.  Once offloading completes that disconnected node could be dropped and the FlowFiles that were offload would get load balanced to remaining nodes still members of the cluster.

I think I covered all the basic behind the scenes functionality of load-balanced connection with regards to disconnected node behaviors.

In your scenario, your node, the node became disconnected due to some issue when changing the version of a version controlled process group.  I would recommend a new community question if you need help with that issue as it has no direct relationship with how load-balance connection function or disconnected nodes still running discussed here.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

3 REPLIES 3

avatar
Master Mentor

@shiva239 

 

  1. Is this expected behavior for NiFi node to be active although it is not in cluster
    1. Yes:  A disconnected node that is still running will continue to run its enabled and running NiFi components processing the existing FlowFiles on that specific node and ingesting new data as well.  The node will still be aware that it is a node that belongs to a cluster, so components will still utilize Zookeeper for any cluster stored state data (read and update).  It is simply no longer connected but all functionality persists.  What i can not do while disconnect is receive and configuration changes that the elected cluster coordinator is replicating to all nodes currently part of the cluster.
    2. From a node in the cluster, you should be able to go to the cluster UI and look at the node that is marked as disconnected to see the recorded reason or disconnection (such as lack of heartbeat).  A node that disconnects not as the result of user manual action, should automatically attempt to reconnect as it will still attempt to send heartbeats to the elected cluster coordinator reporting from Zookeeper.    When the cluster coordinator receives one of these heartbeats from the disconnected node, it will initiate a node reconnection.  During this reconnection the nodes dataflow (flow.json) is compared with the cluster's current dataflow.   In order for the node to rejoin its local flow must match cluster flow.  If it does not, the node will attempt to inherit the cluster flow.  It inheritance of the cluster flow is not possible, this will be logged with reason (one common reason is cluster flow no longer has a connection that the local flow still has which contain FlowFiles.  NiFi will not inherit a flow that would result in dataloss on the local node).

  2. Can we alter the behavior through configuration to stop processors temporarily while node is not connected to cluster?
    1. NiFi has no option to stop processors on a disconnected.  Not clear on the use case why you would want to do this?   The expectation is that an unexpected disconnection (commonly due to lack of heartbeat) would auto reconnect once heartbeats resume to cluster coordinator.    Plus a disconnected node does not mean loss of functionality in the disconnected node.  The disconnected node can still execute its dataflow just as it was while connected.  While all nodes in the cluster keep their dataflows in-sync and use zookeeper for any cluster state sharing, they all execute base do their local copy of the flow.json and execute their own node specific set of FlowFiles. This continues even when a node is disconnected because that node still knows it was part of a cluster.

I find this comment interesting:
"Furthermore, the affected node did not attempt to reconnect to the cluster on its own."

  • Did you check the reason recorded for why this node disconnected?  (Did a user manually disconnect the node? or was it disconnected for another reason)
  • Did you inspect the logs on the disconnected node and the elected cluster coordinator around the time of the disconnection?
  • Do you see disconnected node logging any issue communicating with Zookeeper?
  • Do you see disconnected node attempting to send heartbeats to the currently elected cluster coordinator?
  • Is the current cluster coordinator logging receiving these heartbeats?

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Rising Star

Thank you @MattWho  for providing the details! I now have a much clearer understanding of what was happening.

I don't have detailed logs for the last occurrence, but I'll share them if we encounter the issue again. One thing is certain: the node was disconnected due to a discrepancy in the flow.json file. This happened during the deployment of NiFi flows, specifically while upgrading the version of an existing flow.

If it had been a standalone execution of the disconnected node, there wouldn't have been an issue, I believe. The actual problem arose because some flow files were stuck in one of the flows, at a connection where load balancing is enabled with the "partition by attribute" strategy (as shown in the image). I assume those records were waiting to be transferred to the other two nodes, which were not accessible to the disconnected node.

Could you explain this situation further for my better understanding? That will help us take the right steps. Please let me know if more details are required.

 

shiva239_0-1746033912952.png

 

 

avatar
Master Mentor

@shiva239 

There are numerous things happening when a node is disconnected.

  1. A disconnected node is different then a dropped node.  A cluster node must be disconnected before it can be dropped.  
  2. A node can become disconnected in two ways:
    1. Manually disconnected - User manually disconnects a node via the NiFi Cluster UI.  A manually disconnected node will not attempt to auto-rejoin cluster. A user can manually reconnect the node from another node in the cluster via the same Cluster UI.
    2. A node becomes disconnected due to some issue.  

A node that is disconnected is still part of the cluster until it is dropped.  Once dropped the cluster no longer considers it part of the cluster.  This distinction matter when it comes to load balanced connections that use a load balance strategy other then Round Robin.

  • Load balance connections use NiFi Site-To-Site protocol to move FlowFiles between nodes.  Only Connected nodes are eligible to have FlowFiles sent over Site-To-Site.
  • Even a disconnected node is still able to load-balance FlowFiles to other nodes still connected in the cluster.   (so when you had 1 node disconnect from cluster, if you went to that nodes UI directly the load balanced connection would appear to processing all FlowFiles normally.  This is because the two nodes where it sends some FlowFiles by attribute are still connected and thus it is allowed to send to them.  The other FlowFiles by attribute destined for disconnected node never leave the node and get processed locally.  Over on the still cluster connected nodes the story is different.  They can only send to connected nodes and any FlowFiles destined for that disconnected node will begin to queue.  Even if you stopped the dataflows on the disconnected node the FlowFiles would continue to queue for that node.  So stopping the dataflow on a node that disconnects would still present same issue.
  • A disconnected node is still aware of what node are part of the cluster and can still communicate with ZK to know which node is the elected cluster coordinator.  Lets say a second node disconnects.  The disconnected node would stop attempting to send to that now disconnected node and queue FlowFiles destined for that node.
  • Only the round robin strategy will attempt to redistribute FlowFile to remaining connected nodes when a node becomes disconnected.  The Partition by attribute and Single Node strategies are used when it is important that "like" FlowFiles end up on the same node for downstream processing (So once a like FlowFile, which in your case are FlowFiles with same value in the orderid FlowFile attribute, is marked for node 3, all FlowFiles with that same orderId will queue for node 3 as long as node three is still a member of the cluster.  A disconnected node is still part of the cluster and will have some "like" FlowFiles already on it, so we would not want NiFi to start sending "Like" data to some other node all of a sudden.  Now if manual user action was taken to drop the disconnected node, then the load-balanced connections would start using a different node for the FlowFiles original being allocated to the disconnected node. 

NiFi also offers an off-loading feature. This allows a user with proper authorization to off-load a disconnected node (IMPORTANT: only a reachable and running node can be offloaded successfully. Attempting Offloading of a down or unreachable node will not work).  Once a node is disconnected a user can choose to offload the node this is typical if say a user want to decommission a node in the cluster.  Initiating off-load will send a request to that disconnected node to stop, terminate all running components and then off load the queued FlowFiles to other nodes connected to the cluster.   If cluster nodes where allowed to continue to load-balance to disconnected node(s), this capability would fail as you would end-up with a constant loop of FlowFiles back to disconnected node.  Once offloading completes that disconnected node could be dropped and the FlowFiles that were offload would get load balanced to remaining nodes still members of the cluster.

I think I covered all the basic behind the scenes functionality of load-balanced connection with regards to disconnected node behaviors.

In your scenario, your node, the node became disconnected due to some issue when changing the version of a version controlled process group.  I would recommend a new community question if you need help with that issue as it has no direct relationship with how load-balance connection function or disconnected nodes still running discussed here.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt