Support Questions

sagarshimpi · ‎08-24-2018

Problem Statement:

Few nodemanager in cluster are shutting down/getting crashed with below error -

2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data07/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data08/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data10/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:31,583 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data09/hadoop/yarn/local/usercache/XXX/appcache/applic
ation_1533656250055_31336
2018-08-24 09:37:33,138 FATAL yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error.  Shutting down now
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.BufferedReader.<init>(BufferedReader.java:105)
        at java.io.BufferedReader.<init>(BufferedReader.java:116)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:554)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:225)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:445)
2018-08-24 09:37:33,145 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(542)) - Deleting path : /data01/hadoop/yarn/log/application_1533656250055_31336/container_e
92_1533656250055_31336_01_000001/directory.info

Ambari Version: 2.4.2.0

HDP Version: 2.5.3.0

Analysis: From Ambari-Yarn configs I see that the Node Manager Heap is set to 1GB.

I see few links which says increasing Heap to 2GB resolves the issue. Eg -

http://www-01.ibm.com/support/docview.wss?uid=swg22002422

Suggestion/Help expecting:

1. Can you guide on how to debug this GC error further for RCA?
Do you thing enabling GC log and Using "Jconsole" tool we can debug the jobs - why and where its using more heap/memory ?

2. How can we confirm that 1GB heap is not correct size for the cluster before I proceed it increasing to 2GB.

3. Also how can i make sure increasing to 2GB i am not going to hit GC issue again? Is there any forecasting I can do here to prevent the issue from happening in future?

Please do let me know if you need any more details.

tsharma · ‎08-24-2018

You can set -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps flags in YARN_NODEMANAGER_OPTS and then view nodemanager logs in GC visualizer like gceasy.io.

This error occurs when all objects are referenced/live and subsequent GC cycles can't reclaim > 2% of heap space.

https://plumbr.io/outofmemoryerror/gc-overhead-limit-exceeded

sagarshimpi · ‎08-24-2018

@pjoseph @Nanda Kumar

pls share your views

Cloudera Community

Support Questions

Nodemanager process crashed due to 'GC overhead limit exceeded'