Support Questions

Find answers, ask questions, and share your expertise

3 node cluster managed by 3 node zookeeper cluster, primary failing to startup and throwing IllegalClusterStateException

avatar
Reader

Hi

I have a 3 node Apache Nifi cluster setup, which is managed by a 3 node zookeeper cluster.

The dev cluster worked fine, with one node frequently dropping off with us having to sometimes manually restarting the node, after renaming its flow.xml.gz and flow.json.gz, after which the node started up fine and connected to the cluster.

But today, after 1 node went down, it wouldnt connect back to the cluster (even after renaming the flow gzs). Within some minutes another node disconnected from the cluster, and the last node which was the primary at that stage threw a socket time out, so I manually restarted it, and it wont startup throwing 

Invalid State Cannot replicate request to Node oooo-nifiat01.yy.xxx.local:0000 because the node is not connected
with the nifi-user.log complaining of 
o.a.n.w.a.c.IllegalClusterStateExceptionMapper org.apache.nifi.cluster.manager.exception.IllegalClusterStateException: The Flow Controller is initializing the Data Flow.. Returning Conflict response.
 
It looks like the flow.xml.gz/flow.json.gz is corrupted on primary and we have a whole lot of dev which we cannot afford to lose. Could anyone please help in how we can restore the primary node, and once its online, I can bring up the other 2 nodes. 
 
Thanks
MK
 
 
2 REPLIES 2

avatar
Community Manager

@MK77, Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts, @MattWho, @SAMSAL, and @Shelton , who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Mentor

@MK77 

First lets clarify the Zookeeper (ZK) elected roles in Apache NiFi.

Primary:

  • ZK elects one node in the cluster as the "Primary" node.   Processor components on the canvas configured to with Execution=Primary node will only get scheduled on that elected primary node.  No other nodes will schedule these processors to execute.

Cluster Coordinator:

  • ZK elects one of the nodes as the cluster coordinator.  Other nodes learn which node is the elected cluster coordinator from ZK.  All nodes will send node heartbeats to the cluster coordinator to form the cluster.

Any node in the NiFi cluster can be assigned either or both of these roles.  There is no guarantee that the same node(s) will always be assigned these roles.  Even after NiFi cluster is formed and roles are assigned, which nodes are assigned these roles can change.

The flow.json.gz contain the dataflows on the canvas that are loaded on startup.  The flow.xml.gz is only loaded if the flow.json.gz is missing.   If NiFi loads the dataflow from the flow.xml.gz, it will generate a flow.json.gz from that flow.xml.gz.

 

Now on to your problem....

Neither of the log lines you shared point to any problem:

Invalid State Cannot replicate request to Node <node-hostname:port> because the node is not connected

This log line simply tells you that this node can't replicate a request to anothetr node yet because it has not has not connected yet to the cluster.

o.a.n.w.a.c.IllegalClusterStateExceptionMapper org.apache.nifi.cluster.manager.exception.IllegalClusterStateException: The Flow Controller is initializing the Data Flow.. Returning Conflict response.

This simply tells you that the flow.json.gz is still being initialized (loaded).   This process needs to complete before the node finishes startup and can join the cluster.  Depending on which Apache NiFi version you are running and the size of yoru dataflow, this can take some time to complete.  

What is the complete version of NiFi you are using?

Without your full logs it is not possible from what has been shared to tell you what is going on or even if there really is any corruption with your flow.json.gz.

One thing you can do is configure yoru NiFi to startup with all components on yoru canvas stopped instead of their last known state.   This can be helpful if you have added a recent new dataflow that is perhaps causing issues initializing at startup.

This achieved by changing the following setting in the nifi,properties file. Save a backup of your flow.json.gz before starting after changing this setting.  The saved flow.json.gz will have the original saves state (Running, Stopped, Disabled) of all the components.

nifi.flowcontroller.autoResumeState=false

 

If your NiFi cluster starts fine after making this change, you can restart your dataflows to see if any are having issues.

Beyond the above suggestion, there is not enough information shared to suggest anything else.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt