Created 11-18-2021 09:14 AM
Hi,
We are on nifi-1.11.4, on a 3 node cluster, and we are observing frequent node disconnections.
..Disconnect Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from node in 42 seconds, updateId=62]
While we investigate the issue for lack of heartbeat, we were trying to increase the disconnection threshold from the default 40 seconds [nifi.cluster.protocol.heartbeat.missable.max=8 X nifi.cluster.protocol.heartbeat.interval=5 sec] to 300 seconds. However, even when we applied the nifi.cluster.protocol.heartbeat.missable.max=60 setting change on all nodes, we still observe the node disconnections occurring after 40 seconds. Any thoughts on why this setting is not getting picked up?
nifi.cluster.protocol.heartbeat.interval=5 sec
nifi.cluster.protocol.heartbeat.missable.max=60
Thank you
-MT13
Created 11-22-2021 08:37 AM
It seems like the property nifi.cluster.protocol.heartbeat.missable.max made configurable on NiFi 1.12, before this it is using the default value of 8. Which is the reason you are not able to get the expected timeframe results.
But instead of looking to tune the timeouts to higher values, you should look to tune the dataflow designs. The lack of heartbeat is mostly occurring due to high resource utilization either memory or CPU or network. The choice of processors and incoming data has to be analyzed for a more stable cluster.
In NiFi-1.12 a new feature was added to review the node status history from NIFi UI > Global Menu > Node Status History. It shows the resources utilization in graphical form for each node which would be a great indicator to analyze the load, incoming traffic, heap utilization, etc.
Created 11-22-2021 08:37 AM
It seems like the property nifi.cluster.protocol.heartbeat.missable.max made configurable on NiFi 1.12, before this it is using the default value of 8. Which is the reason you are not able to get the expected timeframe results.
But instead of looking to tune the timeouts to higher values, you should look to tune the dataflow designs. The lack of heartbeat is mostly occurring due to high resource utilization either memory or CPU or network. The choice of processors and incoming data has to be analyzed for a more stable cluster.
In NiFi-1.12 a new feature was added to review the node status history from NIFi UI > Global Menu > Node Status History. It shows the resources utilization in graphical form for each node which would be a great indicator to analyze the load, incoming traffic, heap utilization, etc.
Created 11-22-2021 09:26 AM
@ashinde, thank you very much for the response. Makes sense why that setting is not working for us on 1.11.4. I will have to check when we can be on 1.12 version to use some of the additional features that'd help us with the troubleshooting.
Point taken about finding the actual RC for the heartbeat delay. CPU, and Memory (GC, heap usage) have not been an issue so far during our investigations. We are not really running any high volume processing here, and we are on a cluster with 3 nodes each with 24 cores, and 22 GB heap allocation. Network is one thing we need to check further, but so far basic analysis does not show any delay b/w the nodes or any packet loss.
For now we have stopped the frequent node disconnections due to missing heartbeats by increasing nifi.cluster.protocol.heartbeat.interval=30 sec, while we investigate the issue further.
Best,
MT13