Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi - 2 nodes of the cluster take very long time to restart

avatar
Explorer

Hello,

In my HDF cluster of 12 Nifi nodes with version 1.8.0.3, 2 nodes takes around 40 minutes to restart. Other nodes takes only 5 minutes.

Each node has 40 cores, 20gb of Heap and we have different volumes (6 for content_repository, 1 for flowfile_repository and 1 for provenance repository)

I don't see any errors in logs.

Do you know where this could come from?

I can give you more information if needed.

Thanks for your help

5 REPLIES 5

avatar
Master Mentor

@Elnozy 

HDF 3.3 (Based off Apache NiFi 1.8) is almost 7 years old at this point in time.  CFM 2.1.6 is the latest Cloudera NiFi offering based off Apache NiFi 1.23+ version.  

There have been many improvements to NiFi over the years including fixes that greatly improve NiFi startup time.

Your best option is to upgrade to mitigate many security issues, bugs fixes, and get the many improvements including those that improve startup speed.

The difference in startup times is most likely directly associated to the amount of FlowFiles queued on some nodes versus others.  

But, here are just some examples of fixed issues related to slow startup:

  1. NIFI-9382
  2. NIFI-9289
  3. NIFI-7999

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

@MattWho 

Thanks for your reply.

Unfortunately it is not that simple to do a version upgrade, it is a production cluster.

What I don't understand is that only two nodes pose a problem and they are always the same 2 nodes.

Thanks and Best Regards

avatar
Master Mentor

@Elnozy 
Other then you examining a serious of spaced out thread dumps from those two nodes during the long startups, it would be difficult to know the specific reason. 

But you could look to see if these two nodes have a lot more queued data then the other two nodes.   

Are their content, flowfile, and provenance repositories a lot larger then the other nodes?

As fars as the numerous known startup improvements made over the years, there are no workarounds to them other then getting those improvements through upgrade.

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Super Collaborator

If clustered, is Zookeeper running on each node or has that been separated? Wondering if selecting a new master or having an acceptable quorum is contributing to the slowness. 

avatar
Explorer

Hello,

Here is the threaddump of last restart of node 2304. We took a threaddump every 5 minutes: threaddump 
I notice only "Cleanup Archive for contentX" that seems take more than 5 minutes for some content repo. Don't know if this cleaning can be a blocking point.
And maybe I'm missing something on the interpretation of threaddump.

I take also some screens of the cluster view to check if there is more usage of the 2 bad nodes (2304 and 2311). The 2 nodes has 40GB more flowfiles (6% of usage instead of 5% for others): Screen cluster 

Nifi is clustered and we have three zookeeper server nodes dedicated for Nifi.
Do you know how we can check zookeeper actions: election of the Cluster and Primary role?

Thanks for your help
Best Regards