Created on 07-18-2020 02:15 PM - edited 09-16-2022 07:38 AM
Hi All,
I am trying to add new nodes to my existing cluster ( which was built by Cloudera Director ) , new node was getting added with the help of Cloudera Manager manually.
When I try to add new node the NodeManager is not getting started but Cloudera Manager agent thinks that it started and running , but it's neither getting connected to resource manager nor even writing any logs ( /var/log/hadoop-yarn/ ) to log directory making it difficult to find what was going on.
NodeManager runs successfully in the nodes which are built by Cloudera Director but it's not running in the nodes added manually with the help of Cloudera Manager.
I could only see `stderr.log` and no other logs were found. I compared this file with the node running successfully and it all looks good.
Since the node is not producing any logs, I couldn't provide any logs from /var/log/hadoop-yarn/*NODEMANAGER*log.out file.
But following are the logs from cloudera-scm-agent.log ( which doesn't show any error , except after some time throws an connection refused on 8042 since NodeManager was not running)
[18/Jul/2020 16:57:40 +0000] 10596 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.03 LIFE_MAX:0.06
[18/Jul/2020 17:05:03 +0000] 10596 CP Server Thread-6 _cplogging INFO 10.2.6.75 - - [18/Jul/2020:17:05:03] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0"
[18/Jul/2020 17:05:03 +0000] 10596 MainThread process INFO [135-yarn-NODEMANAGER] Updating process: False {u'running': (True, False), u'run_generation
': (1, 2)}
[18/Jul/2020 17:05:03 +0000] 10596 MainThread process INFO Deactivating process 135-yarn-NODEMANAGER
[18/Jul/2020 17:05:04 +0000] 10596 MainThread agent INFO Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 CP Server Thread-7 _cplogging INFO 10.2.6.75 - - [18/Jul/2020:17:05:05] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0"
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO [135-yarn-NODEMANAGER] Updating process.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Deleting process 135-yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO [135-yarn-NODEMANAGER] Orphaning process
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Using generic audit plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Creating metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Using specific metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Using generic metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Creating profile plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util INFO Using generic profile plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO [136-yarn-NODEMANAGER] Instantiating process
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO [136-yarn-NODEMANAGER] Updating process: True {}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO First time to activate the process [136-yarn-NODEMANAGER].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Created /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Chowning /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER to yarn (983) hadoop (990)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Chmod'ing /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER to 0751
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Created /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Chowning /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs to yarn (983) hadoop (990)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Chmod'ing /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs to 0751
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO [136-yarn-NODEMANAGER] Refreshing process files: None
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel INFO prepare_environment begin: {u'SPARK2': u'2.4.0.cloudera2-1.cdh5.13.3.p0.1041012', u'CDH': u'5.16.2-1.cdh5.16.2.p0.8'}, [u'cdh'], [u'hdfs-client-plugin', u'cdh-plugin', u'yarn-plugin', u'mapreduce2-plugin']
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel INFO The following requested parcels are not available: {}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel INFO Obtained tags ['spark2'] for parcel SPARK2
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel INFO Obtained tags ['cdh', 'impala', 'kudu', 'sentry', 'solr', 'spark'] for parcel CDH
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel INFO prepare_environment end: {'CDH': '5.16.2-1.cdh5.16.2.p0.8'}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Extracted 18 files and 0 dirs to /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/topology.map to /etc/hadoop/conf.cloudera.CD-YARN-PbpthEaX/topology.map.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/topology.py to /etc/hadoop/conf.cloudera.CD-YARN-PbpthEaX/topology.py.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/container-executor.cfg to /var/lib/yarn-ce/etc/hadoop/container-executor.cfg.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Copied special file /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/bin/container-executor to /var/lib/yarn-ce/bin/container-executor.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO [136-yarn-NODEMANAGER] Evaluating resource: [{u'path': u'/data0/yarn/nm', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/data0/yarn/container-logs', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/var/lib/hadoop-yarn/yarn-nm-recovery', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/var/log/hadoop-yarn', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 509}, {u'path': u'/var/log/hadoop-yarn/stacks', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}]
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO reading limits: {u'limit_memlock': None, u'limit_fds': None}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO [136-yarn-NODEMANAGER] Launching process. one-off False, command yarn/yarn.sh, args [u'nodemanager']
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent INFO Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Begin audit plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO Scheduling a refresh for Audit Plugin for yarn-NODEMANAGER with count 1 pipelines names [''].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Begin metadata plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO Scheduling a refresh for Metadata Plugin for yarn-NODEMANAGER with count 1 pipelines names [''].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Begin profile plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO Scheduling a refresh for Profile Plugin for yarn-NODEMANAGER with count 1 pipelines names ['TP'].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO Instantiating generic monitor for service YARN and role NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Begin monitor refresh.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread abstract_monitor INFO Refreshing GenericMonitor YARN-NODEMANAGER for None
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__ INFO New monitor: (<cmf.monitor.generic.GenericMonitor object at 0x7fa3452d1650>,)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process INFO Daemon refresh complete for process 136-yarn-NODEMANAGER.
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO stopping Audit Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO Piplines updated for Audit Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO Refreshing Audit Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO stopping Metadata Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO Piplines updated for Metadata Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO Refreshing Metadata Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO stopping Profile Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO Piplines updated for Profile Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO Refreshing Profile Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:06:05 +0000] 10596 MonitorDaemon-Scheduler __init__ INFO Monitor expired: ('GenericMonitor YARN-NODEMANAGER for CD-YARN-PbpthEaX-NODEMANAGER-f2098a33ab399cc2ac267af283ff7430',)
Error that shows connection refused since NodeManager not running
10596 Monitor-GenericMonitor throttling_logger ERROR (61 skipped) Error fetching metrics at 'http://us-east-1a-test-east-cdh-ta
sknode4241.throtle-test.internal:8042/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 200, in _collect_and_pa
rse_and_return
self._adapter.safety_valve))
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/url_util.py", line 223, in urlopen_with_retry_on_authentication_err
ors
return function()
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 217, in _open_url
password=self._password_value)
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/url_util.py", line 70, in urlopen_with_timeout
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
Here is the log from supervisord.log file
2020-07-18 17:05:05,228 INFO spawned: '136-yarn-NODEMANAGER' with pid 48155
2020-07-18 17:05:26,039 INFO success: 136-yarn-NODEMANAGER entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)
Can any one help me in fix the issue?
Thanks & Regards,
Nagamalleswara.
Created 07-20-2020 05:42 AM
A quick update,
Even though I was not able to find the root cause, or fix for this root cause, I just tried rebooting the server running NodeManager service, and now I could see that NodeManager was up and running.
But for every service restart I see the same issue of NodeManager not running and had to reboot server to get it up and running.
Created 07-21-2020 08:33 AM
Hello @Nagamalleswara ,
thank you for sharing with us the detailed description of how your NM seems to have issues on a certain node.
Next time you hit this problem, can you please check if you see multiple NM instances on a single node?
In some cases the CM agent might gets into a bad state.
When you hit the issue, please run the following command and see if it showes 2 NM processes running on the host
ssh into the host ; become root
# export SUPER_CONF=/var/run/cloudera-scm-agent/supervisor/supervisord.conf
# /usr/lib64/cmf/agent/build/env/bin/supervisorctl -c $SUPER_CONF status
52372-yarn-NODEMANAGER RUNNING pid 71771, uptime 2:21:19
52723-yarn-NODEMANAGER FATAL Exited too quickly (process
log may have details)
If so, it means that the CM-agent process was bad.
In this case please:
1) Proceed to kill the PID. Verify that the PID does not exist and neither is anything else listening on the port.
service cloudera-scm-agent restart
3) Start the RM process from CM
Please share with us your findings. If you are not hitting the above described issue it still helps us to narrow down further this issue.
Please also check if the NM heap size on this node is set to at least 1Gb just to rule out stop the world GC activity rendering the NM unresponsive.
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 07-21-2020 12:45 PM
Thanks @Bender for detailed instructions.
Yes I am still facing this issue and trying to chase this down for last few days.
I have tried the commands you've provided and here is the output
[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]# export SUPER_CONF=/var/run/cloudera-scm-agent/supervisor/supervisord.conf
[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]# /usr/lib64/cmf/agent/build/env/bin/supervisorctl -c $SUPER_CONF status
138-cluster-host-inspector EXITED Jul 21 03:00 PM
201-hdfs-DATANODE RUNNING pid 96936, uptime 0:11:58
209-hbase-REGIONSERVER RUNNING pid 97227, uptime 0:11:31
214-yarn-NODEMANAGER RUNNING pid 97307, uptime 0:11:30
cmflistener RUNNING pid 7412, uptime 4:53:45
flood RUNNING pid 7449, uptime 4:53:44
[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]#
So in my case I don't have problem of two NM instances running in a single node. Also here is the process output from ps command
yarn 97307 0.0 0.0 2858224 23704 ? Sl 15:22 0:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-LGzbJezU_CD-YARN-LGzbJezU-NODEMANAGER-f953f0c79fd5345f10fb347aa90e7500_pid97307.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-LGzbJezU-NODEMANAGER-us-east-1a-test-east-cdh-corenode4163.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-LGzbJezU-NODEMANAGER-us-east-1a-test-east-cdh-corenode4163.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
Yes, the NodeManager by default given 1 GB , and I even tried giving 2 GB & 4 GB still the issue persists.
Please note that I am seeing this issue on random servers, for every restart of cluster/YARN service. ( As mentioned earlier the process will be running but no logs are written, just cloudera-scm-agent complains that it's not able to connect to 8042 port )
Could you provide any other ways/method to debug this issue?
Note :
Not sure if it helps, when we tested cluster with small nodes ( r4.xlarge ) we're not seeing this issue. We are seeing this issue when we increased size of the node to r4.8xlarge
Created 07-23-2020 12:08 AM
Hello @Nagamalleswara ,
What is odd is that the issue happens on large AWS nodes only and intermittently. Did you check if ports are open bi-directionally on the affected nodes for NMs?
Can you access the NM WebUI on the affected node, please?
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 07-23-2020 04:10 AM
Hi @Bender
Yes we checked ports and verified that they're accessible., and yes it's very strange to see issue like this.
I was not able to do anything since the process ( yarn NODEMANAGER ) will be running but does nothing like not even writing logs.
In some cases reboot of the server works and some cases service restart works. But in some cases even server reboot doesn't work.
I have tried following options
jcmd <pid> Thread.print >> /path/to/file
Please let me know if I can check anything else to resolve the issue.
Created on 07-23-2020 06:51 AM - edited 07-23-2020 06:52 AM
Hello @Nagamalleswara ,
please make sure you are using the yarn user when collecting the thread dumps.
I've found useful this description too for figuring out what is going on with the JVM.
Kind regards:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 07-23-2020 08:47 AM
Thanks for providing the link for JVM analysis. it's really helpful.
Yes, in my case the process was not responding so I should've used `kill -3` option.
I will try next time and provide you results.
Created 07-24-2020 10:14 AM
Hi @Bender
As provided in the link I tried to produce thread & dump files from the running process, but as I mentioned earlier those process were getting killed/throwing error.
Here is the output am getting when I run jmap as per the doc/link provided
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ ps -fe | grep nodemanager
yarn 5235 12503 0 13:07 ? 00:00:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-QafZaOEK_CD-YARN-QafZaOEK-NODEMANAGER-4756e03a64cd1a4e535550d4cd740b08_pid5235.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
yarn 5240 5235 0 13:07 ? 00:00:00 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-redactor /usr/lib64/cmf/service/yarn/yarn.sh nodemanager
yarn 5487 31141 0 13:07 pts/0 00:00:00 grep --color=auto nodemanager
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jmap -heap 5235 > /tmp/jmap_5235_heap.out
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.NullPointerException
at sun.jvm.hotspot.tools.HeapSummary.run(HeapSummary.java:157)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:50)
... 6 more
Here is the error message I am getting when I run jstack with -l option
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ ps -fe | grep nodemanager
yarn 4518 12503 0 13:04 ? 00:00:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-QafZaOEK_CD-YARN-QafZaOEK-NODEMANAGER-4756e03a64cd1a4e535550d4cd740b08_pid4518.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
yarn 4523 4518 0 13:04 ? 00:00:00 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-redactor /usr/lib64/cmf/service/yarn/yarn.sh nodemanager
yarn 4717 31141 0 13:05 pts/0 00:00:00 grep --color=auto nodemanager
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jstack -F 4518 > /tmp/jstack_4518_f.out
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ less /tmp/jstack_4518_f.out
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jstack -l 4518 > /tmp/jstack_4518_f.out
4518: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
For the process 4518 I ran jstack with -F option and here is the output
Debugger attached successfully.
Server compiler detected.
JVM version is 25.181-b13
Deadlock Detection:
No deadlocks found.