Created 02-02-2026 04:20 AM
After an overnight shutdown/startup, the cluster becomes unstable:
Parallel startup via Step Function Map state causes:
Deleted flow.xml.gz on disconnected nodes → restart → nodes rejoined successfully
Sequential startup with proper wait times:
1. ZK1 → 60s → ZK2 → 60s → ZK3 → 120s (quorum)2. NiFi Registry → 90s3. Node 1 → 180s → Node 2 → 120s → Node 3 → 120s
Sequential shutdown (reverse order)
Has anyone implemented similar scheduled automation for NiFi clusters? Any guidance appreciated!
Created 02-02-2026 08:12 AM
@fy-test
Apache NiFi 2.7.2 does not use a flow.xml.gz file (This format was only used by Apache NiFi 1.x versions). Apache NiFi 2.x versions use a flow.json.gz format.
I would suggest making sure the Zookeeper quorum is up before starting the NiFi Service. NiFi cluster can't form or remain formed if Zookeeper does not have stable quorum.
If your NiFi nodes are disconnecting and reconnecting, I would start by looking at the status of the nodes to see what reason is being given for the disconnects. You can find this in the Cluster UI within NiFi:
Clicking on the small "i" icon to the left of the node name will open the pop-up window above that shows node events. You should also see node events in the nifi-app.logs on each node.
You would normally start all nodes at the same time. NiFi knows how many nodes were last in the cluster and has a flow election process that depends on all nodes connecting. So startup times will be mush longer if not all nodes connect. NiFi has a configurable timer of how long flow election will run before finishing starting with just the nodes that connected.
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 02-02-2026 08:23 AM
Thank you for the guidance!
Answers to your questions:
1. Coordinator stable (Node 2), but overloaded (90% CPU, 2.6s heartbeat latency)
2. Yes, 8+ hours between stop/start (overnight shutdown)
3. No backlog (queues nearly empty)
4. No OOM exceptions
5. No long GC pauses observed
6. Yes, coordinator logs: "no heartbeat from node in 15089 seconds" (= 4.2hr downtime)
Key issue: Step Function starts all services in PARALLEL. ZooKeeper nodes and NiFi nodes all start together, so NiFi connects before ZK quorum forms.
Solution implemented:
- Sequential ZK startup (ZK1 → ZK2 → ZK3 with waits for quorum)
- Parallel NiFi node startup (all 3 together after ZK is ready)
- Delete flow.json.gz on disconnected nodes → successful rejoin
Question: Should we clear ZooKeeper /nifi state after 8hr shutdown, or does stable quorum + parallel node startup handle stale state automatically?
All nodes now connected and stable. Will monitor through next scheduled cycle.
Created 02-02-2026 08:59 AM
This is interesting line shared from your logs.
no heartbeat from node in 15089 secondsThis implies that the elected cluster coordinator disconnected a node after not receiving a heartbeat for 15089 seconds. This means the node was in a connected state. On startup of a cluster, all nodes are in a disconnected state initially until they connect, so this is not a line I would expect to see during startup. Was this accompanied by a line that stated node was being disconnected due to lack of heartbeat?
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 02-02-2026 11:56 AM
After extensive troubleshooting and testing, the root cause was identified:
starting all 3 ZooKeeper nodes before NiFi causes cluster instability during scheduled restarts.
Startup sequence:
Result: All nodes connect cleanly, no flapping, stable cluster formation
What are your thoughts about this setup?
Created 02-02-2026 12:32 PM
@fy-test
Starting only on ZK node will not give quorum, so the NiFi cluster would not form. NiFi nodes would come up and continue to attempt to connect to ZK quorum for cluster coordinator election before cluster could form. All NiFi nodes need to learn which node is elected to this role in order to know which node to send heartbeats to in order to form a NiFi cluster.
I'd say you have some other issues if your ZK quorum cluster is not stable when NiFi is started. My ZK is completely up with quorum when I start any of my NiFi clusters.
If Quorum keeps coming and going due to some issue in your ZK, that could cause NiFi nodes to disconnect from cluster and reconnect when quorum exists again.
The real question here is why is your ZK not coming up well when you start all of the ZK hosts at the same time. I'd spend more time looking at the health of your ZK.
All you should need to do is start ZK nodes so you have quorum and then start NiFi and NiFi-Registry (order of NiFi and NiFi-Registry start does not matter).
Thanks,
Matt