Created 01-03-2024 07:52 AM
Hello,
In my HDF cluster of 12 Nifi nodes with version 1.8.0.3, 2 nodes takes around 40 minutes to restart. Other nodes takes only 5 minutes.
Each node has 40 cores, 20gb of Heap and we have different volumes (6 for content_repository, 1 for flowfile_repository and 1 for provenance repository)
I don't see any errors in logs.
Do you know where this could come from?
I can give you more information if needed.
Thanks for your help
Created 01-03-2024 08:53 AM
@Elnozy
HDF 3.3 (Based off Apache NiFi 1.8) is almost 7 years old at this point in time. CFM 2.1.6 is the latest Cloudera NiFi offering based off Apache NiFi 1.23+ version.
There have been many improvements to NiFi over the years including fixes that greatly improve NiFi startup time.
Your best option is to upgrade to mitigate many security issues, bugs fixes, and get the many improvements including those that improve startup speed.
The difference in startup times is most likely directly associated to the amount of FlowFiles queued on some nodes versus others.
But, here are just some examples of fixed issues related to slow startup:
If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 01-04-2024 12:08 AM
Thanks for your reply.
Unfortunately it is not that simple to do a version upgrade, it is a production cluster.
What I don't understand is that only two nodes pose a problem and they are always the same 2 nodes.
Thanks and Best Regards
Created 01-04-2024 07:04 AM
@Elnozy
Other then you examining a serious of spaced out thread dumps from those two nodes during the long startups, it would be difficult to know the specific reason.
But you could look to see if these two nodes have a lot more queued data then the other two nodes.
Are their content, flowfile, and provenance repositories a lot larger then the other nodes?
As fars as the numerous known startup improvements made over the years, there are no workarounds to them other then getting those improvements through upgrade.
If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 01-05-2024 01:43 PM
If clustered, is Zookeeper running on each node or has that been separated? Wondering if selecting a new master or having an acceptable quorum is contributing to the slowness.
Created 01-10-2024 06:00 AM
Hello,
Here is the threaddump of last restart of node 2304. We took a threaddump every 5 minutes: threaddump
I notice only "Cleanup Archive for contentX" that seems take more than 5 minutes for some content repo. Don't know if this cleaning can be a blocking point.
And maybe I'm missing something on the interpretation of threaddump.
I take also some screens of the cluster view to check if there is more usage of the 2 bad nodes (2304 and 2311). The 2 nodes has 40GB more flowfiles (6% of usage instead of 5% for others): Screen cluster
Nifi is clustered and we have three zookeeper server nodes dedicated for Nifi.
Do you know how we can check zookeeper actions: election of the Cluster and Primary role?
Thanks for your help
Best Regards