Support Questions

Find answers, ask questions, and share your expertise
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

NiFi 2.7.2 Cluster Instability After Scheduled Stop/Start - Seeking Best Practices

avatar
Explorer

Environment

  • NiFi 2.7.2, 3-node cluster (3EC2 - 3ASG)
  • ZooKeeper: 3-node ensemble (ECS)
  • NiFi Registry (ECS)
  • AWS Step Functions for scheduled stop/start (cost optimization)

Problem

After an overnight shutdown/startup, the cluster becomes unstable:

  • Nodes rapidly flap: CONNECTED → DISCONNECTED → CONNECTING (every 2-3 seconds).
  • Error: "Have not received heartbeat from node".
  • Critical: Disconnected node logs show NO errors/warnings - node appears healthy while coordinator reports it as disconnected.

Root Cause

Parallel startup via Step Function Map state causes:

  1. NiFi nodes start before the ZooKeeper quorum forms
  2. All 3 nodes start simultaneously → chaotic coordinator election

Resolution

Deleted flow.xml.gz on disconnected nodes → restart → nodes rejoined successfully

Proposed Solution

Sequential startup with proper wait times:

 
1. ZK1 → 60s → ZK2 → 60s → ZK3 → 120s (quorum)2. NiFi Registry → 90s3. Node 1 → 180s → Node 2 → 120s → Node 3 → 120s

Sequential shutdown (reverse order)

Questions

  1. Official guidance on startup sequencing for multi-node clusters with external ZooKeeper?
  2. Should ZooKeeper state be cleared during scheduled shutdowns?
  3. Why don't disconnected node logs show any issues? Node appears unaware of disconnection.
  4. Recommended wait times between service starts?
  5. Best practices for scheduled start/stop on auto-scaling infrastructure?

Setup Details

  • 32GB RAM, 20GB heap, G1GC
  • Java 21 Amazon Corretto
  • Time sync verified (chrony < 1μs drift)
  • Network healthy, no packet loss

Has anyone implemented similar scheduled automation for NiFi clusters? Any guidance appreciated!

5 REPLIES 5

avatar
Master Mentor

@fy-test 

Apache NiFi 2.7.2 does not use a flow.xml.gz file (This format was only used by Apache NiFi 1.x versions).   Apache NiFi 2.x versions use a flow.json.gz format. 

I would suggest making sure the Zookeeper quorum is up before starting the NiFi Service. NiFi cluster can't form or remain formed if Zookeeper does not have stable quorum.

If your NiFi nodes are disconnecting and reconnecting, I would start by looking at the status of the nodes to see what reason is being given for the disconnects.   You can find this in the Cluster UI within NiFi:

MattWho_0-1770048226524.png

Clicking on the small "i" icon to the left of the node name will open the pop-up window above that shows node events.  You should also see node events in the nifi-app.logs on each node. 

  1. So you see your elected cluster coordinator constantly changing?
  2. Was there a duration of time between stop and start?
  3. Was there a large influx of backlogged data when NiFi was started?
  4. Encounter any OutOfMemory exceptions?
  5. Encounter any long garbage collection events?
  6. Did nifi-app.log on elected cluster coordinator(s) reported any node disconnected due to lack of heartbeat log output?

You would normally start all nodes at the same time.  NiFi knows how many nodes were last in the cluster and has a flow election process that depends on all nodes connecting. So startup times will be mush longer if not all nodes connect.  NiFi has a configurable timer of how long flow election will run before finishing starting with just the nodes that connected. 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

 

avatar
Explorer

Thank you for the guidance!
Answers to your questions:

1. Coordinator stable (Node 2), but overloaded (90% CPU, 2.6s heartbeat latency)
2. Yes, 8+ hours between stop/start (overnight shutdown)
3. No backlog (queues nearly empty)
4. No OOM exceptions
5. No long GC pauses observed
6. Yes, coordinator logs: "no heartbeat from node in 15089 seconds" (= 4.2hr downtime)

Key issue: Step Function starts all services in PARALLEL. ZooKeeper nodes and NiFi nodes all start together, so NiFi connects before ZK quorum forms.

Solution implemented:
- Sequential ZK startup (ZK1 → ZK2 → ZK3 with waits for quorum)
- Parallel NiFi node startup (all 3 together after ZK is ready)
- Delete flow.json.gz on disconnected nodes → successful rejoin

Question: Should we clear ZooKeeper /nifi state after 8hr shutdown, or does stable quorum + parallel node startup handle stale state automatically?

All nodes now connected and stable. Will monitor through next scheduled cycle.

avatar
Master Mentor

@fy-test 

  1. I would not expect this step to be necessary:
    Delete flow.json.gz on disconnected nodes → successful rejoin
    - Flow election happens during startup. Once a flow is elected, nodes that join afterwards will inherit the cluster flow if their local flow does not match. 
  2.  I see no need to clear ZK state.   ZK elects a cluster coordinator and primary node from the nodes that establish connection with ZK.   ZK also used for components in your dataflows that utilize cluster state. Clearing ZK could result in duplicate data processing depending in what your flow does in NiFi.

This is interesting line shared from your logs.

no heartbeat from node in 15089 seconds

This implies that the elected cluster coordinator disconnected a node after not receiving a heartbeat for 15089 seconds.  This means the node was in a connected state.  On startup of a cluster, all nodes are in a disconnected state initially until they connect, so this is not a line I would expect to see during startup.  Was this accompanied by a line that stated node was being disconnected due to lack of heartbeat?

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

After extensive troubleshooting and testing, the root cause was identified:

starting all 3 ZooKeeper nodes before NiFi causes cluster instability during scheduled restarts.

Startup sequence:

  1. Start single ZooKeeper node (ZK1)
  2. Start NiFi Registry (2 min wait)
  3. Start all 3 NiFi nodes in parallel (3 min wait)
  4. Add remaining ZooKeeper nodes (ZK2, ZK3) to complete ensemble

Result: All nodes connect cleanly, no flapping, stable cluster formation

What are your thoughts about this setup?

avatar
Master Mentor

@fy-test 

Starting only on ZK node will not give quorum, so the NiFi cluster would not form.   NiFi nodes would come up and continue to attempt to connect to ZK quorum for cluster coordinator election before cluster could form. All NiFi nodes need to learn which node is elected to this role in order to know which node to send heartbeats to in order to form a NiFi cluster. 

I'd say you have some other issues if your ZK quorum cluster is not stable when NiFi is started.  My ZK is completely up with quorum when I start any of my NiFi clusters. 

If Quorum keeps coming and going due to some issue in your ZK, that could cause NiFi nodes to disconnect from cluster and reconnect when quorum exists again. 

The real question here is why is your ZK not coming up well when you start all of the ZK hosts at the same time.  I'd spend more time looking at the health of your ZK.

All you should need to do is start ZK nodes so you have quorum and then start NiFi and NiFi-Registry (order of NiFi and NiFi-Registry start does not matter).  

Thanks,
Matt