Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFi node keeps disconnecting from the cluster

avatar
Contributor

Hi,

We have a 2-node nifi production cluster with HDF-2.0.1.0 release. It works fine for over a year now. This morning both nodes are connecting, connected, then disconnected from cluster due to lack of heartbeat. nifi.cluster.protocol.heartbeat.interval in nifi.properties is the default 5 sec. From nifi node log, I do not see the heartbeats are created every 5 seconds - in my working dev cluster they are created roughly every 5 seconds. In this production cluster the heartbeats are created every 1 or 2 minutes.

2017-05-30 10:30:07,838 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:30:07,653 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:30:07,838; send took 184 millis 2017-05-30 10:31:14,986 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:31:14,515 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:31:14,986; send took 471 millis 2017-05-30 10:33:44,971 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:33:44,404 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:33:44,971; send took 566 millis 2017-05-30 10:34:15,280 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:34:15,122 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:34:15,280; send took 157 millis 2017-05-30 10:36:21,204 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:36:20,673 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:36:21,204; send took 530 millis

This cluster worked fine yesterday and nothing changed on the system. Can anyone give me some insight why the heartbeats are not created as configured?

Thank you very much in advance!

Xi Sanderson

6 REPLIES 6

avatar
Super Mentor
@Xi Sanderson

The most common reason for not seeing Heartbeats generated on the configured interval is when you have a lot of java full garbage collection (GC) going on. Garbage collection is a stop the world event. The JVM will do nothing else until the garbage collection has completed. Partial garbage collection is normal and healthy and stop the world events here should be extremely short in duration.

You can look at the GC stats fro your NIFi cluster via the system diagnostics UI. You can get tho this UI via link in the lower right corner of the NiFi summary UI:

15859-screen-shot-2017-05-30-at-105328-am.png

Young is normal to see, but you want to see hopefully 0 Old Gen events.

If GC is your issue, you will want to figure out what has changed. New data types or volumes, new dataflows, etc...

Thanks,

Matt

avatar
Contributor

Hi Matt,

Thank you so much for getting back to me so quickly. I cannot access NiFi UI due to the nodes connected, then disconnected so quickly from the cluster. I do see the following logs in nifi node log about every minute:

2017-05-30 11:01:16,777 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@5af47414 checkpointed with 4 Records and 0 Swap Files in 10 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 3 millis), max Transaction ID 114

Is this normal or indication of a problem?

The cluster was fine yesterday when I checked and nothing is changed - I am the only person can make changes, so I know for sure.

Thanks again!

Xi

avatar
Super Mentor

@Xi Sanderson

The above log line is not an indication of a problem.

You can use the jstat command to look at your GC stats:

For example the following command will print GC stats every 250 ms for 1000 lines:

/<path to Java>/bin/jstat -gcutil <nifi pid> 250 1000
ColumnDescription
S0Survivor space 0 utilization as a percentage of the space's current capacity.
S1Survivor space 1 utilization as a percentage of the space's current capacity.
EEden space utilization as a percentage of the space's current capacity.
OOld space utilization as a percentage of the space's current capacity.
PPermanent space utilization as a percentage of the space's current capacity.
YGCNumber of young generation GC events.
YGCTYoung generation garbage collection time.
FGCNumber of full GC events.
FGCTFull garbage collection time.
GCTTotal garbage collection time.

What version of NiFi and what version of Java are you running?

Thanks,

Matt

avatar
Contributor

Hi Matt,

We are using jdk1.8.0_31 and nifi.version=1.0.0.2.0.1.0-12. The following is the first few lines of the output of jstat:

[root@be-bi-nifi-441 conf]# /usr/java/default/bin/jstat -gcutil 3248 250 1000 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

0.00 100.00 36.84 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027

0.00 100.00 77.89 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027

0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027

0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027

0.00 100.00 26.32 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492

0.00 100.00 64.21 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492

0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492

0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492

Looks like NiFi is busy with GC just like you suspected, but I do not understand why. Can you please give me some advices on how to debug this without UI access?

Thank you very much!

Xi

avatar
Super Mentor

Your FGC is not very high at only 4.

Do you still have issues if you restart your cluster with all components in a stopped state?

This would at least show that your issue is dataflow related.

To start with all components stopped, edit the nifi.properties file on all nodes and change the following property to "false"

nifi.flowcontroller.autoResumeState=true

Once you cluster is back up and you can access the UI, check for anywhere in your stopped flow where you have queued data (look at largest queues first).

Start these flows first and look at impact they are having on GC and Heap memory usage.

Another possibility is that you are are oversubscribing to your available resources.

Use the "top" command to observer impact on your system's resources (Namely CPU).

Make sure you have not allocated to many concurrent tasks to your individual processors.

Make sure you have not set your "max timer driven thread count" and "max event driven thread count" values to high. These values should be set to 2-4 times the number of cores you have in one node. (fro example if you have a 4 node cluster and each node in your cluster has 16 cores, the max values set should be between 32 and 64 (max 2-4 times 16 cores).

Thanks,

Matt

avatar
Contributor

Hi Matt,

I tried exactly what you suggested while I was waiting for your reply. I was able to access the UI without the data flow running. I looked at the System Diagnostics you mentioned earlier. It was "4 times" without data flow running and still "8 times" now after I started data flows.

Our cluster has 2-node and each has 16 cores. The "max timer driven thread count" is set to 64 and "max event driven thread count" is set to 12. I've been monitoring "top", cpu usage at this time (busy hour) is about 700%.

Good news is after I started nifi in stopped state and manually restarted the data flows, the problem I had this morning has not recurred yet. Heartbeats are generated in reasonable intervals - about 7 seconds. What happened this morning is still a mystry to me, but I am happy now it's working.

Thank you so much for all the help!!!

Xi