Support Questions

Find answers, ask questions, and share your expertise

Frequent Node Disconnects and Flow Synchronization Issues in NiFi 1.28.1 with Large Cluster

avatar
New Contributor

Hi Cloudera Community,

We are running Apache NiFi version 1.28.1 in a clustered setup with the following specifications:

  • Cluster Size: 6 nodes
  • Each Node: 32 vCPUs, 256 GB RAM
  • JVM Heap Memory: 192 GB (configured per node)
  • Max Timer Driven Thread Count: 192
  • Processor Count: Over 10,000 processors across the flows
  • Java version 11

We are experiencing the following issues:

  • Frequent node disconnections
  • Flow synchronization failures during node reconnects
  • Occasionally, policies appear empty when nodes rejoin

We have ensured the flow.xml.gz, authorizations.xml, and users.xml files are consistent across all nodes. However, the issues still persist.

Could you please advise:

  • What could be causing these frequent node disconnects and flow sync failures?
  • Is there an upper limit on the number of processors or thread count that could lead to instability?
  • Are there recommended JVM GC or NiFi tuning settings for high-core, high-memory environments?

Any insights or tuning recommendations would be greatly appreciated.

4 REPLIES 4

avatar
Community Manager

@Siva227 Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @MattWho @mburgess  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Senior Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

nifi.properties:
nifi.cluster.protocol.heartbeat.interval=5 sec

nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=2 mins
nifi.cluster.node.read.timeout=2 mins
nifi.cluster.node.max.concurrent.requests=150
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=1 mins
nifi.cluster.flow.election.max.candidates=
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=12
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs

zookeeper.properties:
initLimit=10
autopurge.purgeInterval=24
syncLimit=5
tickTime=2000
dataDir=./state/zookeeper
autopurge.snapRetainCount=30
these are the properties related to nifi and zookeeper

We are seeing below errors in logs all the time
Node disconnected due to Proposed Authorizer is not inheritable by the Flow Controller because NiFi has already started the dataflow and Authorizer has differences: Proposed Authorizations do not match current Authorizations: Proposed fingerprint is not inheritable because the current access policies is not empty.

Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption

avatar
Master Mentor

@Siva227 

The error you shared is when the node is trying to reconnect to the cluster following a disconnection.  So first we need to identify why the node disconnected originally.  I suspect you are disconnecting due to lack of heartbeat or your node failed to process a change request from the cluster coordinator node.

  • Cluster Size: 6 nodes
  • Each Node: 32 vCPUs, 256 GB RAM
  • JVM Heap Memory: 192 GB (configured per node)
  • Max Timer Driven Thread Count: 192
  • Processor Count: Over 10,000 processors across the flows

Any specific reason why you configured your NiFi to use so much heap memory.  Large heaps like this result in long stop-the-world Garbage Collections (GC).   These long garbage collection stop-the -world events can lead to disconnections as a result of lack of heartbeat from that node.  A common mistake is setting heap very large simply becuase you have a lot of memory on the node.  You want to use the smallest heap possible as needed by yoru dataflows. GC does not kick in until heap usage reaches ~80%.

The below property controls heartbeat interval and lack of heartbeat disconnection:
 nifi.cluster.protocol.heartbeat.interval=5 sec

The cluster coordinator will disconnect a node due lack of heartbeat if a heartbeat has not been received for 8 times this configured value (40 seconds in this case).   It is very possible you encounter GC that last longer then this.  I recommend changing your heartbeat interval to 30 sec which will allow up 4 mins of missed heartbeats before the cluster coordinator will disconnect a node. 

The following error shared, while not initial cause of node disconnection, is preventing node from reconnecting:

Node disconnected due to Proposed Authorizer is not inheritable by the Flow Controller because NiFi has already started the dataflow and Authorizer has differences: Proposed Authorizations do not match current Authorizations: Proposed fingerprint is not inheritable because the current access policies is not empty.

This implies that there are differences in the authorizations.xml file on this node versus what the cluster has in its authorizations.xml.   You also state this is the error seen ver often after a node disconnection?

Are you often modifying or setting up new authorization access policy when you have a node disconnect?  

I'd start with identifying the initial cause of node disconnection which I suspect is either lack of heartbeat or failed to replicate request to node resulting in node being disconnected.  Both of which can happen with long GC pauses.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

 

avatar
Community Manager

@Siva227 Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.


Regards,

Diana Torres,
Senior Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: