Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

NiFi node disconnection | missable heartbeat window

New Contributor

Hi,

 

We are on nifi-1.11.4, on a 3 node cluster, and we are observing frequent node disconnections.

 

..Disconnect Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from node in 42 seconds, updateId=62]

 

While we investigate the issue for lack of heartbeat, we were trying to increase the disconnection threshold from the default 40 seconds [nifi.cluster.protocol.heartbeat.missable.max=8 X nifi.cluster.protocol.heartbeat.interval=5 sec] to 300 seconds. However, even when we applied the nifi.cluster.protocol.heartbeat.missable.max=60 setting change on all nodes, we still observe the node disconnections occurring after 40 seconds. Any thoughts on why this setting is not getting picked up?

 

nifi.cluster.protocol.heartbeat.interval=5 sec
nifi.cluster.protocol.heartbeat.missable.max=60

 

Thank you

-MT13

1 ACCEPTED SOLUTION

Contributor

@mt13  Looking at this

It seems like the property nifi.cluster.protocol.heartbeat.missable.max made configurable on NiFi 1.12, before this it is using the default value of 8. Which is the reason you are not able to get the expected timeframe results.

But instead of looking to tune the timeouts to higher values, you should look to tune the dataflow designs. The lack of heartbeat is mostly occurring due to high resource utilization either memory or CPU or network. The choice of processors and incoming data has to be analyzed for a more stable cluster.

In NiFi-1.12 a new feature was added to review the node status history from NIFi UI > Global Menu > Node Status History. It shows the resources utilization in graphical form for each node which would be a great indicator to analyze the load, incoming traffic, heap utilization, etc.

View solution in original post

2 REPLIES 2

Contributor

@mt13  Looking at this

It seems like the property nifi.cluster.protocol.heartbeat.missable.max made configurable on NiFi 1.12, before this it is using the default value of 8. Which is the reason you are not able to get the expected timeframe results.

But instead of looking to tune the timeouts to higher values, you should look to tune the dataflow designs. The lack of heartbeat is mostly occurring due to high resource utilization either memory or CPU or network. The choice of processors and incoming data has to be analyzed for a more stable cluster.

In NiFi-1.12 a new feature was added to review the node status history from NIFi UI > Global Menu > Node Status History. It shows the resources utilization in graphical form for each node which would be a great indicator to analyze the load, incoming traffic, heap utilization, etc.

New Contributor

@ashindethank you very much for the response. Makes sense why that setting is not working for us on 1.11.4. I will have to check when we can be on 1.12 version to use some of the additional features that'd help us with the troubleshooting.

 

Point taken about finding the actual RC for the heartbeat delay. CPU, and Memory (GC, heap usage) have not been an issue so far during our investigations. We are not really running any high volume processing here, and we are on a cluster with 3 nodes each with 24 cores, and 22 GB heap allocation. Network is one thing we need to check further, but so far basic analysis does not show any delay b/w the nodes or any packet loss.

 

For now we have stopped the frequent node disconnections due to missing heartbeats by increasing nifi.cluster.protocol.heartbeat.interval=30 sec, while we investigate the issue further.

 

Best,

MT13