Support Questions

Find answers, ask questions, and share your expertise
Announcements
We’ve updated our product names and community labels - click here for full details

Suddenly all the NIFI nodes disconnected

avatar
Visitor

Environment - NiFi 1.23.2, 3-node cluster

Problem

All of sudden, all the cluster nodes becomes unavailable:

Spoiler
Error: Received disconnection request message from cluster coordinator with explanation: org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.
Disconnecting node due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.
Event Reported for atl-fm-nifi01.corp.pps.io:8443 -- Node disconnected due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.

After restarting the service it restored can someone please let us know the possible reasons for the disconnect ? 

4 REPLIES 4

avatar
Community Manager

@Vishesh,Welcome to our community! To help you get the best possible answer, I have tagged in our NiFi experts  @MattWho @steven-matison who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Mentor

@Vishesh 

It is not possible to say exactly what issue you may have encountered here.  Do you still have the complete stack trace that followed the node disconnection exception?  It is likely to have some "Caused by:.." lines in the full stack trace that may help.

Any changes being done when the disconnection occurred?  

  1. Stopping/starting a dataflow or process group?
  2. Adding a template to the canvas?  (if you are still using templates, they were deprecated in NiFi 1.x and Flow definitions are the new method to use.  Templates were officially removed in NiFi 2.x).  I have corruption issues related to templates previously.

When you restarted your service, what observations were made in the nifi-app.log on all three nodes during startup?  A flow election happens first where like flows each get a vote, the flow with the most ote becomes the cluster flow and nodes without that flow will join and inherit that cluster flow. One of your three nodes would have been elected as cluster coordinator and all other nodes would have formed the cluster by sending heartbeats to that node. During that node connection phase, any node with a mismatched flow would inherit the cluster flow.  Any logging related to one or more of your nodes inheriting the cluster flow on startup?

If not that then could be possible that some component was stuck in an enabling component state.  So when you start a component(s) on the canvas, the component goes initially to "starting" and then "started".  Likewise, stopping a component transitions to "stopping" and then "stopped".  You may have been in situation where your nodes had a component stuck in the "stopping" or "starting" phase, but your cluster coordinator completed the transition.   This could be caused by a bug in the component, load on the system, component has very long running process or hung process working on a FlowFile with large content, etc...   Inspecting some thread dumps from those disconnected running nodes might help identifying scenario.   This might be the most likely cause for you?  I say this because if your flow.json.gz was corrupt, restarting your cluster would have had exceptions when trying to load the corrupted flow.json.gz.  When stopping the nodes, NiFi eventually times out waiting for nodes to gracefully completed running threads and kills them.  Then on restart no flow.json.gz corruption, all nodes restart fine, flow loaded successfully, and set the components to same running state.


While none of above is a definitive answer because there is not enough info to provide that, hopefully this give you an idea of what could have happened so you know what to collect or look at deeper should it happen again.  

I will add that a number of fixes have gone into the newer releases (some around NiFi clustering).  Apache NIFi 1.x is officially end of life.  If you can not migrate to the newer Apache NiFi 2.x branch, you should at least upgrade to the latest Apache NiFi 1.28 release to take advantage of fixes done there since 1.23.


Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Visitor

I have done few analysis from Logs where in exactly node went down at 8:56 AM EST

First error - Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling 'HEARTBEAT' protocol message

Second error - Cluster failed processing request: org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.. Returning Service Unavailable response.

Third error - Disconnecting node due to Failed to properly handle Reconnection request due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.

Fourth error - Node disconnected due to Failed to properly handle Reconnection request due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.

avatar
Master Mentor

@Vishesh 

Sorry, but there is not enough information to give a definitive cause for your issue.  There is typically full stack traces to go with these types of logged exceptions which can tell us more, but even then it may not be enough information still.  

As i mentioned before there are numerous improvements between Apache NIFi 1.23 and the last Apache NIFi 1.x release version 1.28.  I'd strongly encourage you to upgrade to see if your issue persists.  

You may be hitting this bug  NIFI-12232  which was addressed in versions 1.26+. 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt