Created 05-03-2024 03:34 AM
I have a nifi cluster, which keeps having issue of nodes getting disconnected and cluster is unstable.
What should be the approach to debug the issue?
Created 05-03-2024 08:04 AM
Also keep in mind that the nifi-app.log will log node events as well and it may help to inspect those logs to see if any other notable logged events happened around that same time. Was the node that got disconnect the currently elected primary node (you could tell by logs in another node reporting it as being elected as primary node just after the previous elected primary node was disconnected. If that pattern is consistent, then your dataflow may heavily use "primary node" only scheduled processors and you are not handling FlowFile load balancing programmatically in your dataflow design(s).
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 05-03-2024 08:13 AM
Created 05-13-2024 01:11 AM
' High core load average'...what value would mean high core load?
In my case I see it between 20-30 mostly.
Created 05-20-2024 02:18 PM
@manishg
How many cpu cores does each of your NiFi hosts have?
1 means you are using 100% of 1 cpu on average.
20 means you are using 100% of 20 cores on average.
etc...
so lets say your node has 8 cores but your load average is higher then 8, this means your cpu is saturated and being asked to perform more work then can be handled efficiently. This leads to long thread execution times and can interfere with timely heartbeats being sent by nodes or processed by the elected cluster coordinator.
Often times this is triggered by too many concurrent tasks on high CPU usage processors, high FlowFile volume, etc. You can ultimately design a dataflow that simply needs more CPU then you have to work at the throughput you need. User commonly just start configuring more and more concurrent tasks and set the Max Timer Driven thread pool way to high for the number of cores available on a node. This allows more threads to execute concurrently, but just results in each thread taking longer to complete as their time is sliced on the CPU. thread 1 gets some time on CPU 1 and then goes to time wait as another thread gets some time, eventually thread 1 will get a bit more time. More millisecond threads that is not a big deal, but for CPU intensive processors it can cause issues. Lets say you have numerous CPU intensive thread executing at same time, and the heartbeat is scheduled. the scheduled thread is now waiting in line for time on the CPU.
Sometimes Alternate dataflow design can be used that use less CPU. Sometimes you can add more nodes. Sometimes you need to move some dataflows to different cluster. Sometimes you just need more CPU.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt