Created 05-30-2017 02:46 PM
Hi,
We have a 2-node nifi production cluster with HDF-2.0.1.0 release. It works fine for over a year now. This morning both nodes are connecting, connected, then disconnected from cluster due to lack of heartbeat. nifi.cluster.protocol.heartbeat.interval in nifi.properties is the default 5 sec. From nifi node log, I do not see the heartbeats are created every 5 seconds - in my working dev cluster they are created roughly every 5 seconds. In this production cluster the heartbeats are created every 1 or 2 minutes.
2017-05-30 10:30:07,838 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:30:07,653 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:30:07,838; send took 184 millis 2017-05-30 10:31:14,986 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:31:14,515 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:31:14,986; send took 471 millis 2017-05-30 10:33:44,971 INFO [Clustering Tasks Thread-2] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:33:44,404 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:33:44,971; send took 566 millis 2017-05-30 10:34:15,280 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:34:15,122 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:34:15,280; send took 157 millis 2017-05-30 10:36:21,204 INFO [Clustering Tasks Thread-3] o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-30 10:36:20,673 and sent to be-bi-nifi-441.soleocommunications.com:8085 at 2017-05-30 10:36:21,204; send took 530 millis
This cluster worked fine yesterday and nothing changed on the system. Can anyone give me some insight why the heartbeats are not created as configured?
Thank you very much in advance!
Xi Sanderson
Created on 05-30-2017 02:57 PM - edited 08-18-2019 12:18 AM
The most common reason for not seeing Heartbeats generated on the configured interval is when you have a lot of java full garbage collection (GC) going on. Garbage collection is a stop the world event. The JVM will do nothing else until the garbage collection has completed. Partial garbage collection is normal and healthy and stop the world events here should be extremely short in duration.
You can look at the GC stats fro your NIFi cluster via the system diagnostics UI. You can get tho this UI via link in the lower right corner of the NiFi summary UI:
Young is normal to see, but you want to see hopefully 0 Old Gen events.
If GC is your issue, you will want to figure out what has changed. New data types or volumes, new dataflows, etc...
Thanks,
Matt
Created 05-30-2017 03:15 PM
Hi Matt,
Thank you so much for getting back to me so quickly. I cannot access NiFi UI due to the nodes connected, then disconnected so quickly from the cluster. I do see the following logs in nifi node log about every minute:
2017-05-30 11:01:16,777 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@5af47414 checkpointed with 4 Records and 0 Swap Files in 10 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 3 millis), max Transaction ID 114
Is this normal or indication of a problem?
The cluster was fine yesterday when I checked and nothing is changed - I am the only person can make changes, so I know for sure.
Thanks again!
Xi
Created 05-30-2017 03:34 PM
The above log line is not an indication of a problem.
You can use the jstat command to look at your GC stats:
For example the following command will print GC stats every 250 ms for 1000 lines:
/<path to Java>/bin/jstat -gcutil <nifi pid> 250 1000
Column | Description |
---|---|
S0 | Survivor space 0 utilization as a percentage of the space's current capacity. |
S1 | Survivor space 1 utilization as a percentage of the space's current capacity. |
E | Eden space utilization as a percentage of the space's current capacity. |
O | Old space utilization as a percentage of the space's current capacity. |
P | Permanent space utilization as a percentage of the space's current capacity. |
YGC | Number of young generation GC events. |
YGCT | Young generation garbage collection time. |
FGC | Number of full GC events. |
FGCT | Full garbage collection time. |
GCT | Total garbage collection time. |
What version of NiFi and what version of Java are you running?
Thanks,
Matt
Created 05-30-2017 04:13 PM
Hi Matt,
We are using jdk1.8.0_31 and nifi.version=1.0.0.2.0.1.0-12. The following is the first few lines of the output of jstat:
[root@be-bi-nifi-441 conf]# /usr/java/default/bin/jstat -gcutil 3248 250 1000 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 100.00 36.84 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027
0.00 100.00 77.89 33.25 95.26 89.80 5545 1731.464 4 3.563 1735.027
0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027
0.00 100.00 93.68 33.25 95.26 89.80 5546 1731.464 4 3.563 1735.027
0.00 100.00 26.32 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492
0.00 100.00 64.21 33.90 95.26 89.80 5546 1731.930 4 3.563 1735.492
0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492
0.00 100.00 93.68 33.90 95.26 89.80 5547 1731.930 4 3.563 1735.492
Looks like NiFi is busy with GC just like you suspected, but I do not understand why. Can you please give me some advices on how to debug this without UI access?
Thank you very much!
Xi
Created 05-30-2017 05:21 PM
Your FGC is not very high at only 4.
Do you still have issues if you restart your cluster with all components in a stopped state?
This would at least show that your issue is dataflow related.
To start with all components stopped, edit the nifi.properties file on all nodes and change the following property to "false"
nifi.flowcontroller.autoResumeState=true
Once you cluster is back up and you can access the UI, check for anywhere in your stopped flow where you have queued data (look at largest queues first).
Start these flows first and look at impact they are having on GC and Heap memory usage.
Another possibility is that you are are oversubscribing to your available resources.
Use the "top" command to observer impact on your system's resources (Namely CPU).
Make sure you have not allocated to many concurrent tasks to your individual processors.
Make sure you have not set your "max timer driven thread count" and "max event driven thread count" values to high. These values should be set to 2-4 times the number of cores you have in one node. (fro example if you have a 4 node cluster and each node in your cluster has 16 cores, the max values set should be between 32 and 64 (max 2-4 times 16 cores).
Thanks,
Matt
Created 05-30-2017 06:45 PM
Hi Matt,
I tried exactly what you suggested while I was waiting for your reply. I was able to access the UI without the data flow running. I looked at the System Diagnostics you mentioned earlier. It was "4 times" without data flow running and still "8 times" now after I started data flows.
Our cluster has 2-node and each has 16 cores. The "max timer driven thread count" is set to 64 and "max event driven thread count" is set to 12. I've been monitoring "top", cpu usage at this time (busy hour) is about 700%.
Good news is after I started nifi in stopped state and manually restarted the data flows, the problem I had this morning has not recurred yet. Heartbeats are generated in reasonable intervals - about 7 seconds. What happened this morning is still a mystry to me, but I am happy now it's working.
Thank you so much for all the help!!!
Xi
Created 05-03-2024 03:37 AM
@MattWho , GC stats would give information for entire cluster or that single node only?
Created 05-03-2024 07:24 AM
@manishg I recommend starting a new community question with your detailed query. This community question is 5.5 years old for a 5.5 year old version of NiFi. A lot has evolved and changed since then.
Make sure with your query you include details about your NiFi version.
Thank you,
Matt