Created 08-16-2024 02:16 PM
Hello,
I have a 5-node cluster, running nifi 1.19.1. Recently, I got multiple alerts saying CPU usage is too high (more than 380% in a 4-core server). When I tailed nifi-app.log, I found the node was kept trying to join the cluster. An weird thing I found was as following fragment:
2024-08-16 10:21:44,490 INFO [Process Cluster Protocol Request-15] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 8bd938b4-6e77-4353-8eef-1e00662efd82 (type=HEARTBEAT, length=5905 bytes) from mnprdlgmsyslog04.ds.cpr.ca:9443 in 145 millis
2024-08-16 10:21:44,501 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2024-08-16 10:21:41,904 and sent to mnprdlgmsyslog04.ds.cpr.ca:11443 at 2024-08-16 10:21:44,501; determining Cluster Coordinator took 105 millis; DNS lookup for coordinator took 0 millis; connecting to coordinator took 2334 millis; sending heartbeat took 146 millis; receiving first byte from response took 11 millis; receiving full response took 11 millis; total time was 2596 millis
2024-08-16 10:21:45,436 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator mnprdlgmsyslog04.ds.cpr.ca:9443 is now connected
2024-08-16 10:21:45,436 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTING, updateId=5092] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096]
2024-08-16 10:21:49,683 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for mnprdlgmsyslog04.ds.cpr.ca:9443 -- Received first heartbeat from connecting node. Node connected.
2024-08-16 10:21:50,066 INFO [Process Cluster Protocol Request-46] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096]
2024-08-16 10:21:53,305 INFO [Process Cluster Protocol Request-6] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 1d1299f8-79eb-4406-af53-d00f8d57d858 (type=HEARTBEAT, length=5909 bytes) from mnprdlgmsyslog04.ds.cpr.ca:9443 in 113 millis
2024-08-16 10:21:53,311 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2024-08-16 10:21:49,504 and sent to mnprdlgmsyslog04.ds.cpr.ca:11443 at 2024-08-16 10:21:53,311; determining Cluster Coordinator took 7 millis; DNS lookup for coordinator took 0 millis; connecting to coordinator took 3679 millis; sending heartbeat took 109 millis; receiving first byte from response took 10 millis; receiving full response took 10 millis; total time was 3806 millis
2024-08-16 10:21:53,902 INFO [Process Cluster Protocol Request-48] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=DISCONNECTED, Disconnect Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from node in 377 seconds, updateId=5097]
The node (mnprdlgmsyslog04) is itself the coordinator. As you can see, it created and processed the heartbeat first, then changed the node state to CONNECTED, which was expected. However, although another heartbeat was created and processed 3 seconds later (at 2024-08-16 10:21:53), the nodeconnectionstatus was changed to Disconnect and the reason was "Have not received a heartbeat from node in 377 seconds".
My questions are:
1) why second heartbeat didn't prevent state changed to DISCONNECTED?
2) where the "377 seconds" in the reason came from?
Any insights would be greatly appreciated.
Created 08-16-2024 03:27 PM
@HappyHao Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our NiFi experts @MattWho @mburgess who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,