Created 04-17-2023 06:36 PM
Hello NiFi Community!
I have a three-node NiFi cluster. I use this to ingest data from multiple source systems. From time to time, we experience this issue where the flowfiles are getting stucked in round robin queues and just sit idly there.
I've also tried adding funnels just to test if they can still proceed downstream.
Upon monitoring the disk usage of the cluster, it never exceeds 30% utilization.
Anyone knows what's causing this issue? Tried checking the nifi-app logs but no luck.
Thank you!
Created 04-17-2023 06:43 PM
Also note that all of the stucked flowfiles are in the same node. Restarting the node typically solves the issue but I want to prevent this issue from happening again.
Created 04-18-2023 01:17 AM
@databoi, Welcome to our community! To help you get the best possible answer, I have tagged in our NiFi experts @cotopaul @SAMSAL @MattWho who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 04-18-2023 02:32 AM
hi @databoi,
It would help if you could also provide your NiFi Version, as each version has it s own perks and twitches.
What you have experienced so far can have plenty of root causes and it is not quite easy to debug 😞 I assume that this happens only a single node, all the time, right? Something similar happened to me as well and it was not easy to fix ... or at least it was not for me.
My problem was mostly related to how I configured the NiFi Cluster. I have been told that there are some best practices when it comes to configuring NiFi, especially on a bare metal machine:
There were three problems on my side and the solution was as follows:
- I moved the repositories on a different drive (a SSD) with a high I/O, so it could read and write the content faster.
- I increased the open files and the max user processes to 50000 and 10000. And I will increase it again in a couple of days.
- And my third problem was related to the disk hardware, as it was dying, it started to malfunction causing this stop-the-world delays. I replaced it and everything went back to normal.
You should also pay attention to the JVM memory of that particular node. In addition, you could activate the debug mode and even generate some dumps to further analyze (./nifi.sh dump > <name of your dump file>). Another point you could check are the processes on your affected node. Maybe something is causing NiFi to become an zombie process (or your have some zombie processes) which are affecting your overall performance.
I do hope that something from this message might lead you to your root cause. In any case, I strongly recommend you to take into consideration other opinions as well, from other community members, with far more experience than myself.
Created 04-24-2023 08:12 AM
@databoi
I see from your images that you are using Apache NiFi 1.11.4 which is around the time that the Load Balanced connection capability was introduced. There were many bugs subsequently identified in load balanced connection and addressed in future releases. I strongly encourage you to upgrade to the latest NiFi release and see if your issue persists.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt