Created 10-28-2022 12:32 AM
we have old hadoop cluster based on HDP from hortonworks version HDP 2.6.4
cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are rhel 7.2 version , and we not see any problem on OS level
also cluster include 12 workers machines ( worker include the datanode and node manager services )
the story begin when we get alerts from the smoke test script that complain about "`Detected pause in JVM or host machine`" on the standby namenode ,
so based on that we decided to increase the namenode heap size from 60G to 100G
and above setting was based on table that show how much memory to set according to number of files in HDFS
and according to the table we decided to set the namenode heapsize to 100G and then we restart the HDFS service
after HDFS is completely restarted , we still see the messages about `Detected pause in JVM or host machine` , and this is really strange because we almost twice the namenode heap size
so we start to perform deeply testing as by `jstat`
for example we get from jsat low very of FGCT that is really good values and not point on namenode heap problem ( 1837 is the HDFS PID number )
# /usr/jdk64/jdk1.8.0_112/bin/jstat -gcutil 1837 10 10
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
and here is the messages from namenode logs
2022-10-27 14:04:49,728 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2044ms
2022-10-27 16:21:33,973 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2524ms
2022-10-27 17:31:35,333 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2444ms
2022-10-27 18:55:55,387 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2134ms
2022-10-27 19:42:00,816 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2153ms
2022-10-27 20:50:23,624 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2050ms
2022-10-27 21:07:01,240 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2343ms
2022-10-27 23:53:00,507 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2120ms
2022-10-28 00:43:30,633 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1811ms
2022-10-28 00:53:35,120 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2192ms
2022-10-28 02:07:39,660 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2353ms
2022-10-28 02:49:25,018 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1698ms
2022-10-28 03:00:20,592 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2432ms
2022-10-28 05:02:15,093 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2016ms
2022-10-28 06:52:46,672 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approxim
as we can see each 1-2 hours message about pause in `JVM or host machine` is appears
we checked the number of files in HDFS and number of file is 7 million files
so what else we can do , we can increase a little bit the namenode heap size but my feeling is that heap size is really enough
Created 10-28-2022 01:33 AM
Hi @mike_bronson7 , The thumb rule is to have 1 Gb of heap allocated to 1 Million block. Now, as you have already doubled the heap, I think you can check on improving the garbage collector tunings.
You can try tunings the GC threads to see if that helps. Been said that, these pauses are not very huge to cause any real trouble and we don't see it repeating very frequently ( within mins) and can be expected on a busy Cluster.
You can also check if the threshold to report a JVM pause alert can be adjusted to say 5-10 secs which needs some intervention.
Created 11-17-2022 01:19 AM
Created 10-28-2022 01:33 AM
Hi @mike_bronson7 , The thumb rule is to have 1 Gb of heap allocated to 1 Million block. Now, as you have already doubled the heap, I think you can check on improving the garbage collector tunings.
You can try tunings the GC threads to see if that helps. Been said that, these pauses are not very huge to cause any real trouble and we don't see it repeating very frequently ( within mins) and can be expected on a busy Cluster.
You can also check if the threshold to report a JVM pause alert can be adjusted to say 5-10 secs which needs some intervention.
Created 10-28-2022 01:50 AM
since we have the Ambari do you mean that we need to find the GC settings in yarn-env template ?
Created 10-28-2022 03:07 AM
Those would go into hadoop-env inside HDFS.
Created 10-30-2022 05:49 AM
so based on doc seems that we need to increase the CMSInitiatingOccupancyFraction from default 70% to higher value as for example 85%
do you agree with that ?
Created 10-31-2022 05:28 AM
Nope. You will need to increase the GC threads. CMSInitiatingOccupancyFraction decides when to perform a Full GC and default value afaik in HDP is at 92%. Further, the reason for 2 sec pauses are very hard to identify.
Created 12-23-2022 11:10 AM
I have the same problem, but in my case JVM pause detections are happening every 15 mins and detecting between 28337ms, .. 853466ms pauses.
Created 11-17-2022 01:19 AM