Support Questions

HappyHao · ‎08-16-2024

Hello,

I have a 5-node cluster, running nifi 1.19.1. Recently, I got multiple alerts saying CPU usage is too high (more than 380% in a 4-core server). When I tailed nifi-app.log, I found the node was kept trying to join the cluster. An weird thing I found was as following fragment:

2024-08-16 10:21:44,490 INFO [Process Cluster Protocol Request-15] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 8bd938b4-6e77-4353-8eef-1e00662efd82 (type=HEARTBEAT, length=5905 bytes) from mnprdlgmsyslog04.ds.cpr.ca:9443 in 145 millis

2024-08-16 10:21:44,501 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2024-08-16 10:21:41,904 and sent to mnprdlgmsyslog04.ds.cpr.ca:11443 at 2024-08-16 10:21:44,501; determining Cluster Coordinator took 105 millis; DNS lookup for coordinator took 0 millis; connecting to coordinator took 2334 millis; sending heartbeat took 146 millis; receiving first byte from response took 11 millis; receiving full response took 11 millis; total time was 2596 millis

2024-08-16 10:21:45,436 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator mnprdlgmsyslog04.ds.cpr.ca:9443 is now connected

2024-08-16 10:21:45,436 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTING, updateId=5092] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096]

2024-08-16 10:21:49,683 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for mnprdlgmsyslog04.ds.cpr.ca:9443 -- Received first heartbeat from connecting node. Node connected.

2024-08-16 10:21:50,066 INFO [Process Cluster Protocol Request-46] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096]

2024-08-16 10:21:53,305 INFO [Process Cluster Protocol Request-6] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 1d1299f8-79eb-4406-af53-d00f8d57d858 (type=HEARTBEAT, length=5909 bytes) from mnprdlgmsyslog04.ds.cpr.ca:9443 in 113 millis

2024-08-16 10:21:53,311 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2024-08-16 10:21:49,504 and sent to mnprdlgmsyslog04.ds.cpr.ca:11443 at 2024-08-16 10:21:53,311; determining Cluster Coordinator took 7 millis; DNS lookup for coordinator took 0 millis; connecting to coordinator took 3679 millis; sending heartbeat took 109 millis; receiving first byte from response took 10 millis; receiving full response took 10 millis; total time was 3806 millis

2024-08-16 10:21:53,902 INFO [Process Cluster Protocol Request-48] o.a.n.c.c.node.NodeClusterCoordinator Status of mnprdlgmsyslog04.ds.cpr.ca:9443 changed from NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=CONNECTED, updateId=5096] to NodeConnectionStatus[nodeId=mnprdlgmsyslog04.ds.cpr.ca:9443, state=DISCONNECTED, Disconnect Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from node in 377 seconds, updateId=5097]

The node (mnprdlgmsyslog04) is itself the coordinator. As you can see, it created and processed the heartbeat first, then changed the node state to CONNECTED, which was expected. However, although another heartbeat was created and processed 3 seconds later (at 2024-08-16 10:21:53), the nodeconnectionstatus was changed to Disconnect and the reason was "Have not received a heartbeat from node in 377 seconds".

My questions are:

1) why second heartbeat didn't prevent state changed to DISCONNECTED?

2) where the "377 seconds" in the reason came from?

Any insights would be greatly appreciated.

DianaTorres · ‎08-16-2024

@HappyHao Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @MattWho @mburgess who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Cloudera Community

Support Questions

NIFI NodeConnectionStatus error

NiFi Error Handling - Design Pattern

Nifi 2.0.0 M1 Installation error with python

NIFI SSL in Modern Versions of NiFi

Nifi DatabaseTableSchemaRegistry - PutDatabaseReco...

Decyphering error messages in Apache NiFi

NiFi Debugging Tutorial

Nifi: Parse Error for Xml file with 2 doubled tags

NiFi org.xerial.snappy.SnappyError

Nifi Installation error

Error while trying to upgrade nifi-registry from 1...