Support Questions

Find answers, ask questions, and share your expertise

JVM + still have Detected pause in JVM in spite heap size decreased

we have old hadoop cluster based on HDP from hortonworks version HDP 2.6.4
cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are rhel 7.2 version , and we not see any problem on OS level

 

also cluster include 12 workers machines ( worker include the datanode and node manager services )

 

the story begin when we get alerts from the smoke test script that complain about "`Detected pause in JVM or host machine`" on the standby namenode ,

 

so based on that we decided to increase the namenode heap size from 60G to 100G

 

and above setting was based on table that show how much memory to set according to number of files in HDFS

 

and according to the table we decided to set the namenode heapsize to 100G and then we restart the HDFS service

 

after HDFS is completely restarted , we still see the messages about `Detected pause in JVM or host machine` , and this is really strange because we almost twice the namenode heap size

 

so we start to perform deeply testing as by `jstat`

 

for example we get from jsat low very of FGCT that is really good values and not point on namenode heap problem ( 1837 is the HDFS PID number )

 

 

# /usr/jdk64/jdk1.8.0_112/bin/jstat -gcutil 1837 10 10
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720
0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720

 

and here is the messages from namenode logs

 

 

2022-10-27 14:04:49,728 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2044ms
2022-10-27 16:21:33,973 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2524ms
2022-10-27 17:31:35,333 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2444ms
2022-10-27 18:55:55,387 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2134ms
2022-10-27 19:42:00,816 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2153ms
2022-10-27 20:50:23,624 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2050ms
2022-10-27 21:07:01,240 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2343ms
2022-10-27 23:53:00,507 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2120ms
2022-10-28 00:43:30,633 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1811ms
2022-10-28 00:53:35,120 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2192ms
2022-10-28 02:07:39,660 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2353ms
2022-10-28 02:49:25,018 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1698ms
2022-10-28 03:00:20,592 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2432ms
2022-10-28 05:02:15,093 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2016ms
2022-10-28 06:52:46,672 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approxim


as we can see each 1-2 hours message about pause in `JVM or host machine` is appears

 

we checked the number of files in HDFS and number of file is 7 million files

 

so what else we can do , we can increase a little bit the namenode heap size but my feeling is that heap size is really enough

 

Michael-Bronson
1 ACCEPTED SOLUTION

Expert Contributor

@mike_bronson7 

 

--
Was your question answered? Please take some time to click on “Accept as Solution” below this post.

 

View solution in original post

6 REPLIES 6

Expert Contributor

Hi @mike_bronson7 , The thumb rule is to have 1 Gb of heap allocated to 1 Million block. Now, as you have already doubled the heap, I think you can check on improving the garbage collector tunings.

 

https://community.cloudera.com/t5/Community-Articles/NameNode-Garbage-Collection-Configuration-Best-...

 

You can try tunings the GC threads to see if that helps. Been said that, these pauses are not very huge to cause any real trouble and we don't see it repeating very frequently ( within mins) and can be expected on a busy Cluster.

 

You can also check if the threshold to report a JVM pause alert can be adjusted to say 5-10 secs which needs some intervention.

since we have the Ambari do you mean that we need to find the GC settings in yarn-env template ? 

Michael-Bronson

Expert Contributor

so based on doc seems that we need to increase the CMSInitiatingOccupancyFraction from default 70% to higher value as for example 85%

 

do you agree with that ? 

Michael-Bronson

Expert Contributor

Nope. You will need to increase the GC threads. CMSInitiatingOccupancyFraction decides when to perform a Full GC and default value afaik in HDP is at 92%. Further, the reason for 2 sec pauses are very hard to identify.

Expert Contributor

@mike_bronson7 

 

--
Was your question answered? Please take some time to click on “Accept as Solution” below this post.

 
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.