Created 04-25-2025 02:53 PM
Hi
We are currently operating a three-node cluster using Apache NiFi version 2.1.
During a recent deployment of NiFi flows, we encountered an issue related to replication inconsistencies across the nodes.
This problem resulted in one of the nodes being out of the cluster and remained in disconnected state.
Furthermore, the affected node did not attempt to reconnect to the cluster on its own.
Although the node was disconnected from the cluster, it continued to operate independently, allowing it to consume messages, execute database queries, and perform actions as a standalone instance.
I would like to know two things here
Thanks for in advance for your inputs. Let me know if more details are required.
Thanks
Created 05-02-2025 07:25 AM
@shiva239
There are numerous things happening when a node is disconnected.
A node that is disconnected is still part of the cluster until it is dropped. Once dropped the cluster no longer considers it part of the cluster. This distinction matter when it comes to load balanced connections that use a load balance strategy other then Round Robin.
NiFi also offers an off-loading feature. This allows a user with proper authorization to off-load a disconnected node (IMPORTANT: only a reachable and running node can be offloaded successfully. Attempting Offloading of a down or unreachable node will not work). Once a node is disconnected a user can choose to offload the node this is typical if say a user want to decommission a node in the cluster. Initiating off-load will send a request to that disconnected node to stop, terminate all running components and then off load the queued FlowFiles to other nodes connected to the cluster. If cluster nodes where allowed to continue to load-balance to disconnected node(s), this capability would fail as you would end-up with a constant loop of FlowFiles back to disconnected node. Once offloading completes that disconnected node could be dropped and the FlowFiles that were offload would get load balanced to remaining nodes still members of the cluster.
I think I covered all the basic behind the scenes functionality of load-balanced connection with regards to disconnected node behaviors.
In your scenario, your node, the node became disconnected due to some issue when changing the version of a version controlled process group. I would recommend a new community question if you need help with that issue as it has no direct relationship with how load-balance connection function or disconnected nodes still running discussed here.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 04-28-2025 05:41 AM
I find this comment interesting:
"Furthermore, the affected node did not attempt to reconnect to the cluster on its own."
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 04-30-2025 10:27 AM
Thank you @MattWho for providing the details! I now have a much clearer understanding of what was happening.
I don't have detailed logs for the last occurrence, but I'll share them if we encounter the issue again. One thing is certain: the node was disconnected due to a discrepancy in the flow.json file. This happened during the deployment of NiFi flows, specifically while upgrading the version of an existing flow.
If it had been a standalone execution of the disconnected node, there wouldn't have been an issue, I believe. The actual problem arose because some flow files were stuck in one of the flows, at a connection where load balancing is enabled with the "partition by attribute" strategy (as shown in the image). I assume those records were waiting to be transferred to the other two nodes, which were not accessible to the disconnected node.
Could you explain this situation further for my better understanding? That will help us take the right steps. Please let me know if more details are required.
Created 05-02-2025 07:25 AM
@shiva239
There are numerous things happening when a node is disconnected.
A node that is disconnected is still part of the cluster until it is dropped. Once dropped the cluster no longer considers it part of the cluster. This distinction matter when it comes to load balanced connections that use a load balance strategy other then Round Robin.
NiFi also offers an off-loading feature. This allows a user with proper authorization to off-load a disconnected node (IMPORTANT: only a reachable and running node can be offloaded successfully. Attempting Offloading of a down or unreachable node will not work). Once a node is disconnected a user can choose to offload the node this is typical if say a user want to decommission a node in the cluster. Initiating off-load will send a request to that disconnected node to stop, terminate all running components and then off load the queued FlowFiles to other nodes connected to the cluster. If cluster nodes where allowed to continue to load-balance to disconnected node(s), this capability would fail as you would end-up with a constant loop of FlowFiles back to disconnected node. Once offloading completes that disconnected node could be dropped and the FlowFiles that were offload would get load balanced to remaining nodes still members of the cluster.
I think I covered all the basic behind the scenes functionality of load-balanced connection with regards to disconnected node behaviors.
In your scenario, your node, the node became disconnected due to some issue when changing the version of a version controlled process group. I would recommend a new community question if you need help with that issue as it has no direct relationship with how load-balance connection function or disconnected nodes still running discussed here.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt