Created 03-17-2026 12:54 AM
Environment - NiFi 1.23.2, 3-node cluster
Problem
All of sudden, all the cluster nodes becomes unavailable:
After restarting the service it restored can someone please let us know the possible reasons for the disconnect ?
Created 03-17-2026 04:42 AM
@Vishesh,Welcome to our community! To help you get the best possible answer, I have tagged in our NiFi experts @MattWho @steven-matison who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 03-17-2026 06:29 AM
@Vishesh
It is not possible to say exactly what issue you may have encountered here. Do you still have the complete stack trace that followed the node disconnection exception? It is likely to have some "Caused by:.." lines in the full stack trace that may help.
Any changes being done when the disconnection occurred?
When you restarted your service, what observations were made in the nifi-app.log on all three nodes during startup? A flow election happens first where like flows each get a vote, the flow with the most ote becomes the cluster flow and nodes without that flow will join and inherit that cluster flow. One of your three nodes would have been elected as cluster coordinator and all other nodes would have formed the cluster by sending heartbeats to that node. During that node connection phase, any node with a mismatched flow would inherit the cluster flow. Any logging related to one or more of your nodes inheriting the cluster flow on startup?
If not that then could be possible that some component was stuck in an enabling component state. So when you start a component(s) on the canvas, the component goes initially to "starting" and then "started". Likewise, stopping a component transitions to "stopping" and then "stopped". You may have been in situation where your nodes had a component stuck in the "stopping" or "starting" phase, but your cluster coordinator completed the transition. This could be caused by a bug in the component, load on the system, component has very long running process or hung process working on a FlowFile with large content, etc... Inspecting some thread dumps from those disconnected running nodes might help identifying scenario. This might be the most likely cause for you? I say this because if your flow.json.gz was corrupt, restarting your cluster would have had exceptions when trying to load the corrupted flow.json.gz. When stopping the nodes, NiFi eventually times out waiting for nodes to gracefully completed running threads and kills them. Then on restart no flow.json.gz corruption, all nodes restart fine, flow loaded successfully, and set the components to same running state.
While none of above is a definitive answer because there is not enough info to provide that, hopefully this give you an idea of what could have happened so you know what to collect or look at deeper should it happen again.
I will add that a number of fixes have gone into the newer releases (some around NiFi clustering). Apache NIFi 1.x is officially end of life. If you can not migrate to the newer Apache NiFi 2.x branch, you should at least upgrade to the latest Apache NiFi 1.28 release to take advantage of fixes done there since 1.23.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 03-18-2026 08:59 PM
I have done few analysis from Logs where in exactly node went down at 8:56 AM EST
First error - Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling 'HEARTBEAT' protocol message Second error - Cluster failed processing request: org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.. Returning Service Unavailable response. Third error - Disconnecting node due to Failed to properly handle Reconnection request due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. Fourth error - Node disconnected due to Failed to properly handle Reconnection request due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.
Created on 03-19-2026 05:37 AM - edited 03-19-2026 05:38 AM
@Vishesh
Sorry, but there is not enough information to give a definitive cause for your issue. There is typically full stack traces to go with these types of logged exceptions which can tell us more, but even then it may not be enough information still.
As i mentioned before there are numerous improvements between Apache NIFi 1.23 and the last Apache NIFi 1.x release version 1.28. I'd strongly encourage you to upgrade to see if your issue persists.
You may be hitting this bug NIFI-12232 which was addressed in versions 1.26+.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt