Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nodemanager process crashed due to 'GC overhead limit exceeded'

avatar
Expert Contributor

Problem Statement:

Few nodemanager in cluster are shutting down/getting crashed with below error -

2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data07/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data08/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data10/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data09/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:33,138 FATAL yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error.  Shutting down now
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.BufferedReader.<init>(BufferedReader.java:105)
        at java.io.BufferedReader.<init>(BufferedReader.java:116)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:554)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:225)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:445)
2018-08-24 09:37:33,145 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(542)) - Deleting path : /data01/hadoop/yarn/log/application_1533656250055_31336/container_e
92_1533656250055_31336_01_000001/directory.info

Ambari Version: 2.4.2.0

HDP Version: 2.5.3.0


Analysis: From Ambari-Yarn configs I see that the Node Manager Heap is set to 1GB.

I see few links which says increasing Heap to 2GB resolves the issue. Eg -

http://www-01.ibm.com/support/docview.wss?uid=swg22002422

Suggestion/Help expecting:

1. Can you guide on how to debug this GC error further for RCA?
Do you thing enabling GC log and Using "Jconsole" tool we can debug the jobs - why and where its using more heap/memory ?

2. How can we confirm that 1GB heap is not correct size for the cluster before I proceed it increasing to 2GB.

3. Also how can i make sure increasing to 2GB i am not going to hit GC issue again? Is there any forecasting I can do here to prevent the issue from happening in future?

Please do let me know if you need any more details.

2 REPLIES 2

avatar
Super Collaborator

You can set -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps flags in YARN_NODEMANAGER_OPTS and then view nodemanager logs in GC visualizer like gceasy.io.

This error occurs when all objects are referenced/live and subsequent GC cycles can't reclaim > 2% of heap space.

https://plumbr.io/outofmemoryerror/gc-overhead-limit-exceeded

avatar
Expert Contributor