Created on
07-07-2025
03:18 AM
- last edited on
07-07-2025
10:07 PM
by
VidyaSargur
Hi
I have a 3 node Apache Nifi cluster setup, which is managed by a 3 node zookeeper cluster.
The dev cluster worked fine, with one node frequently dropping off with us having to sometimes manually restarting the node, after renaming its flow.xml.gz and flow.json.gz, after which the node started up fine and connected to the cluster.
But today, after 1 node went down, it wouldnt connect back to the cluster (even after renaming the flow gzs). Within some minutes another node disconnected from the cluster, and the last node which was the primary at that stage threw a socket time out, so I manually restarted it, and it wont startup throwing
Created 07-07-2025 03:59 AM
@MK77, Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts, @MattWho, @SAMSAL, and @Shelton , who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 07-07-2025 05:54 AM
@MK77
First lets clarify the Zookeeper (ZK) elected roles in Apache NiFi.
Primary:
Cluster Coordinator:
Any node in the NiFi cluster can be assigned either or both of these roles. There is no guarantee that the same node(s) will always be assigned these roles. Even after NiFi cluster is formed and roles are assigned, which nodes are assigned these roles can change.
The flow.json.gz contain the dataflows on the canvas that are loaded on startup. The flow.xml.gz is only loaded if the flow.json.gz is missing. If NiFi loads the dataflow from the flow.xml.gz, it will generate a flow.json.gz from that flow.xml.gz.
Now on to your problem....
Neither of the log lines you shared point to any problem:
Invalid State Cannot replicate request to Node <node-hostname:port> because the node is not connected
This log line simply tells you that this node can't replicate a request to anothetr node yet because it has not has not connected yet to the cluster.
o.a.n.w.a.c.IllegalClusterStateExceptionMapper org.apache.nifi.cluster.manager.exception.IllegalClusterStateException: The Flow Controller is initializing the Data Flow.. Returning Conflict response.
This simply tells you that the flow.json.gz is still being initialized (loaded). This process needs to complete before the node finishes startup and can join the cluster. Depending on which Apache NiFi version you are running and the size of yoru dataflow, this can take some time to complete.
What is the complete version of NiFi you are using?
Without your full logs it is not possible from what has been shared to tell you what is going on or even if there really is any corruption with your flow.json.gz.
One thing you can do is configure yoru NiFi to startup with all components on yoru canvas stopped instead of their last known state. This can be helpful if you have added a recent new dataflow that is perhaps causing issues initializing at startup.
This achieved by changing the following setting in the nifi,properties file. Save a backup of your flow.json.gz before starting after changing this setting. The saved flow.json.gz will have the original saves state (Running, Stopped, Disabled) of all the components.
nifi.flowcontroller.autoResumeState=false
If your NiFi cluster starts fine after making this change, you can restart your dataflows to see if any are having issues.
Beyond the above suggestion, there is not enough information shared to suggest anything else.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt