Created 08-24-2018 09:49 AM
Problem Statement:
Few nodemanager in cluster are shutting down/getting crashed with below error -
2018-08-24 09:37:31,583 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data07/hadoop/yarn/local/usercache/XXX/appcache/applic ation_1533656250055_31336 2018-08-24 09:37:31,583 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data08/hadoop/yarn/local/usercache/XXX/appcache/applic ation_1533656250055_31336 2018-08-24 09:37:31,583 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data10/hadoop/yarn/local/usercache/XXX/appcache/applic ation_1533656250055_31336 2018-08-24 09:37:31,583 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(537)) - Deleting absolute path : /data09/hadoop/yarn/local/usercache/XXX/appcache/applic ation_1533656250055_31336 2018-08-24 09:37:33,138 FATAL yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now ... java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.BufferedReader.<init>(BufferedReader.java:105) at java.io.BufferedReader.<init>(BufferedReader.java:116) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:554) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:225) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:445) 2018-08-24 09:37:33,145 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(542)) - Deleting path : /data01/hadoop/yarn/log/application_1533656250055_31336/container_e 92_1533656250055_31336_01_000001/directory.info
Ambari Version: 2.4.2.0
HDP Version: 2.5.3.0
Analysis: From Ambari-Yarn configs I see that the Node Manager Heap is set to 1GB.
I see few links which says increasing Heap to 2GB resolves the issue. Eg -
http://www-01.ibm.com/support/docview.wss?uid=swg22002422
Suggestion/Help expecting:
1. Can you guide on how to debug this GC error further for RCA?
Do you thing enabling GC log and Using "Jconsole" tool we can debug the jobs - why and where its using more heap/memory ?
2. How can we confirm that 1GB heap is not correct size for the cluster before I proceed it increasing to 2GB.
3. Also how can i make sure increasing to 2GB i am not going to hit GC issue again? Is there any forecasting I can do here to prevent the issue from happening in future?
Created 08-24-2018 10:20 AM
You can set -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps flags in YARN_NODEMANAGER_OPTS and then view nodemanager logs in GC visualizer like gceasy.io.
This error occurs when all objects are referenced/live and subsequent GC cycles can't reclaim > 2% of heap space.
https://plumbr.io/outofmemoryerror/gc-overhead-limit-exceeded
Created 08-24-2018 01:22 PM