Support Questions

Find answers, ask questions, and share your expertise

Unstable cluster

avatar
Expert Contributor

I have a nifi cluster, which keeps having issue of nodes getting disconnected and cluster is unstable.

What should be the approach to debug the issue?

4 REPLIES 4

avatar
Master Mentor

@manishg 

  1. What reason is NiFi giving for the node disconnection?
    1. Go to NIFi global menu in upper right corner of UI of node that is still connected to cluster and selected "cluster":
      MattWho_1-1714746753454.png

       

    2. From the new UI you will see a list of your nodes. To the left of nodes you will see a small "view details" icon.  Click on that for one of the nodes that experienced a disconnection (It might currently be connected).  This will open a new UI that will contain node events.  Node disconnections and reconnection with reason will be provided here.  
      MattWho_3-1714746935352.png

       

    3. Probably the most common unexpected disconnect reason is "lack of heartbeat".   Within the nifi.properties file you can configure the node heartbeat interval (nifi.cluster.protocol.heartbeat.interval) in the cluster section.  The default is 5 secs and same value must be set on all nodes in your cluster.
    4. This setting control how often a node will attempt to send a heartbeat to the currently elected cluster coordinator node.  The elected cluster coordinator expects to receive at least oe successful heartbeat from a node within 8 times the configured heartbeat interval. so with default 5 sec interval the cluster coordinator needs to receive a heartbeat at least once every 40 seconds or the cluster coordinator will disconnect the node.
  2. If reason is lack of heartbeat the node events will also tell you the time of that event.  NiFi will auto-reconnect a node back in to the cluster if a successful heartbeat is received after a disconnection due to lack of heartbeat occurred.
  3. Things like long JVM Garbage Collection events can result in disconnects.  Sometimes resolving the issue is as simple as increasing the heartbeat interval allowing more time before a node gets disconnected (for example setting to 15 secs on all nodes).  Long garbage collection pauses can happen with large heap settings, large flows processing lots of FlowFiles.  From the same Cluster UI, you can select the "JVM" tab to see basic details about the JVM on each node and see JVM GC details.  This help you identify if GC is disproportionate amongst your nodes. 
  4. CPU saturation on a node can also affect the heartbeat interval.  From the same cluster UI you can select the "System" tab which will tell you the number of cores you have per node and the core load average per node.  High core load average can result in longer times for any given thread to complete.
  5. Often dataflow design can contribute to core load and GC resulting in node disconnections due to lack of heartbeat.  Example, one node in your cluster is executing on way more FlowFiles then any of the other nodes.  This means that your design is not handling FlowFile distribution well causing one node to do much more work.

Also keep in mind that the nifi-app.log will log node events as well and it may help to inspect those logs to see if any other notable logged events happened around that same time.  Was the node that got disconnect the currently elected primary node (you could tell by logs in another node reporting it as being elected as primary node just after the previous elected primary node was disconnected.  If that pattern is consistent, then your dataflow may heavily use "primary node" only scheduled processors and you are not handling FlowFile load balancing programmatically in your dataflow design(s).

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Expert Contributor
Thanks for this detailed reply.
I would investigate on these lines.

avatar
Expert Contributor

' High core load average'...what value would mean high core load?

In my case I see it between 20-30 mostly.

avatar
Master Mentor

@manishg 
How many cpu cores does each of your NiFi hosts have?

1 means you are using 100% of 1 cpu on average.
20 means you are using 100% of 20 cores on average.
etc...

so lets say your node has 8 cores but your load average is higher then 8, this means your cpu is saturated and being asked to perform more work then can be handled efficiently.  This leads to long thread execution times and can interfere with timely heartbeats being sent by nodes or processed by the elected cluster coordinator.

Often times this is triggered by too many concurrent tasks on high CPU usage processors, high FlowFile volume, etc.   You can ultimately design a dataflow that simply needs more CPU then you have to work at the throughput you need.   User commonly just start configuring more and more concurrent tasks and set the Max Timer Driven thread pool way to high for the number of cores available on a node.   This allows more threads to execute concurrently, but just results in each thread taking longer to complete as their time is sliced on the CPU. thread 1 gets some time on CPU 1 and then goes to time wait as another thread gets some time, eventually thread 1 will get a bit more time.  More millisecond threads that is not a big deal, but for CPU intensive processors it can cause issues.  Lets say you have numerous CPU intensive thread executing at same time, and the heartbeat is scheduled.  the scheduled thread is now waiting in line for time on the CPU.

Sometimes Alternate dataflow design can be used that use less CPU.  Sometimes you can add more nodes. Sometimes you need to move some dataflows to different cluster. Sometimes you just need more CPU.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt