Created 09-20-2021 05:26 AM
Hello,
We are running single node nifi in 8 core machine and configured 'Maxmum Timer driven thread count' as 300.
Once in a while Nifi get totally stuck (wont accept flowfiles) and also no app log will generate till we restart nifi manually.
So wondering if its due to nifi thread starvation...
In below article 'NiFi Thread starvation' section its mention that system will stuck in resource competition cycle.
So any idea if nifi needs more threads than system available core then will there be any chance of nifi to get stuck ?
Thanks,
Mahendra
Created 09-24-2021 09:24 AM
That is a possibility. The 'Maximum Timer driven thread count' settings sets a thread pool that is used by the NiFi controller to hand out threads to dataflow components when they execute.
The general recommendation is setting this value to 2 to 4 times the number fo cores present on a single NiFi instance (If you are running a NiFi cluster, this is setting is applied per node and not a max across entire cluster). This does not mean that you can not set the thread pool much higher like you have, but you need to do that cautiously and monitor CPU usage over extended periods of time as your dataflows may fluctuate between periods of high and low CPU demand. It is the cycles of high CPU usage that can become problematic. What you have in your scenario is 8 cores trying to service threads (up to 300) for your dataflows, NiFi core level threads (not part of that thread pool), and threads associated to any other services on the host and the OS. So i suspect you have many thread often in CPU wait, waiting on their time on a core. You could also have a scenario where one thread is WAITING on another thread which is also WAITING on something else. So as the system cycles through all these threads you end up with periods of time of what appears to be a hung system
Your dataflow components used and how they are configured along with volumes of data play in to the overall CPU usage and length of time a thread is actively executing.
Interesting that you stated that all logging stops as well. The fact that all logging stops, makes we wonder if with so many threads, some core thread get left in CPU wait so long they impact logging.
Have you tried getting thread dumps from NiFi when it is in this hung state? Examining a series of thread dumps might help pinpoint if you get in to state were you have threads waiting on other threads that are not progressing. You may also want to take a close look at disk IOPS for all NiFi repos which can affect performance with regards to how long a thread takes to complete. Also keep in mind that large dataflows and large volumes of FlowFiles can lead to a need for many open file handles. Make sure your NiFi Service user has access to a LOT of file handles (999,999 fo example). Your dataflows may also be spinning off a lot of processes, so make sure your NiFi service user also has a high process limit.
Hope this helps you look for areas to dig in to your issue,
Matt
Created 09-24-2021 09:24 AM
That is a possibility. The 'Maximum Timer driven thread count' settings sets a thread pool that is used by the NiFi controller to hand out threads to dataflow components when they execute.
The general recommendation is setting this value to 2 to 4 times the number fo cores present on a single NiFi instance (If you are running a NiFi cluster, this is setting is applied per node and not a max across entire cluster). This does not mean that you can not set the thread pool much higher like you have, but you need to do that cautiously and monitor CPU usage over extended periods of time as your dataflows may fluctuate between periods of high and low CPU demand. It is the cycles of high CPU usage that can become problematic. What you have in your scenario is 8 cores trying to service threads (up to 300) for your dataflows, NiFi core level threads (not part of that thread pool), and threads associated to any other services on the host and the OS. So i suspect you have many thread often in CPU wait, waiting on their time on a core. You could also have a scenario where one thread is WAITING on another thread which is also WAITING on something else. So as the system cycles through all these threads you end up with periods of time of what appears to be a hung system
Your dataflow components used and how they are configured along with volumes of data play in to the overall CPU usage and length of time a thread is actively executing.
Interesting that you stated that all logging stops as well. The fact that all logging stops, makes we wonder if with so many threads, some core thread get left in CPU wait so long they impact logging.
Have you tried getting thread dumps from NiFi when it is in this hung state? Examining a series of thread dumps might help pinpoint if you get in to state were you have threads waiting on other threads that are not progressing. You may also want to take a close look at disk IOPS for all NiFi repos which can affect performance with regards to how long a thread takes to complete. Also keep in mind that large dataflows and large volumes of FlowFiles can lead to a need for many open file handles. Make sure your NiFi Service user has access to a LOT of file handles (999,999 fo example). Your dataflows may also be spinning off a lot of processes, so make sure your NiFi service user also has a high process limit.
Hope this helps you look for areas to dig in to your issue,
Matt