Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

unable to start nodemanager in CDH 5.16.2

avatar
Contributor

Hi All,

 

I am trying to add new nodes to my existing cluster ( which was built by Cloudera Director ) , new node was getting added with the help of Cloudera Manager manually.

 

When I try to add new node the NodeManager is not getting started but Cloudera Manager agent thinks that it started and running , but it's neither getting connected to resource manager nor even writing any logs ( /var/log/hadoop-yarn/ ) to log directory making it difficult to find what was going on.

 

NodeManager runs successfully in the nodes which are built by Cloudera Director but it's not running in the nodes added manually with the help of Cloudera Manager.

 

I could only see `stderr.log` and no other logs were found. I compared this file with the node running successfully and it all looks good.

 

Since the node is not producing any logs, I couldn't provide any logs from /var/log/hadoop-yarn/*NODEMANAGER*log.out file.

 

But following are the logs from cloudera-scm-agent.log ( which doesn't show any error , except after some time throws an connection refused on 8042 since NodeManager was not running)

 

 

 

[18/Jul/2020 16:57:40 +0000] 10596 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.03 LIFE_MAX:0.06
[18/Jul/2020 17:05:03 +0000] 10596 CP Server Thread-6 _cplogging   INFO     10.2.6.75 - - [18/Jul/2020:17:05:03] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0"
[18/Jul/2020 17:05:03 +0000] 10596 MainThread process      INFO     [135-yarn-NODEMANAGER] Updating process: False {u'running': (True, False), u'run_generation
': (1, 2)}
[18/Jul/2020 17:05:03 +0000] 10596 MainThread process      INFO     Deactivating process 135-yarn-NODEMANAGER
[18/Jul/2020 17:05:04 +0000] 10596 MainThread agent        INFO     Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 CP Server Thread-7 _cplogging   INFO     10.2.6.75 - - [18/Jul/2020:17:05:05] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0"
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     [135-yarn-NODEMANAGER] Updating process.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Deleting process 135-yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     [135-yarn-NODEMANAGER] Orphaning process
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Using generic audit plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Creating metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Using specific metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Using generic metadata plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Creating profile plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread util         INFO     Using generic profile plugin for process yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     [136-yarn-NODEMANAGER] Instantiating process
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     [136-yarn-NODEMANAGER] Updating process: True {}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     First time to activate the process [136-yarn-NODEMANAGER].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Created /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Chowning /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER to yarn (983) hadoop (990)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Chmod'ing /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER to 0751
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Created /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Chowning /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs to yarn (983) hadoop (990)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Chmod'ing /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/logs to 0751
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     [136-yarn-NODEMANAGER] Refreshing process files: None
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel       INFO     prepare_environment begin: {u'SPARK2': u'2.4.0.cloudera2-1.cdh5.13.3.p0.1041012', u'CDH': u'5.16.2-1.cdh5.16.2.p0.8'}, [u'cdh'], [u'hdfs-client-plugin', u'cdh-plugin', u'yarn-plugin', u'mapreduce2-plugin']
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel       INFO     The following requested parcels are not available: {}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel       INFO     Obtained tags ['spark2'] for parcel SPARK2
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel       INFO     Obtained tags ['cdh', 'impala', 'kudu', 'sentry', 'solr', 'spark'] for parcel CDH
[18/Jul/2020 17:05:05 +0000] 10596 MainThread parcel       INFO     prepare_environment end: {'CDH': '5.16.2-1.cdh5.16.2.p0.8'}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Extracted 18 files and 0 dirs to /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/topology.map to /etc/hadoop/conf.cloudera.CD-YARN-PbpthEaX/topology.map.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/topology.py to /etc/hadoop/conf.cloudera.CD-YARN-PbpthEaX/topology.py.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Copied special file /run/cloudera-scm-agent/process/136-yarn-NODEMANAGER/container-executor.cfg to /var/lib/yarn-ce/etc/hadoop/container-executor.cfg.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Copied special file /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/bin/container-executor to /var/lib/yarn-ce/bin/container-executor.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     [136-yarn-NODEMANAGER] Evaluating resource: [{u'path': u'/data0/yarn/nm', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/data0/yarn/container-logs', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/var/lib/hadoop-yarn/yarn-nm-recovery', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}, {u'path': u'/var/log/hadoop-yarn', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 509}, {u'path': u'/var/log/hadoop-yarn/stacks', u'bytes_free_warning_threshhold_bytes': 0, u'group': u'hadoop', u'user': u'yarn', u'mode': 493}]
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     reading limits: {u'limit_memlock': None, u'limit_fds': None}
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     [136-yarn-NODEMANAGER] Launching process. one-off False, command yarn/yarn.sh, args [u'nodemanager']
[18/Jul/2020 17:05:05 +0000] 10596 MainThread agent        INFO     Triggering supervisord update.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Begin audit plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO     Scheduling a refresh for Audit Plugin for yarn-NODEMANAGER with count 1 pipelines names [''].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Begin metadata plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO     Scheduling a refresh for Metadata Plugin for yarn-NODEMANAGER with count 1 pipelines names [''].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Begin profile plugin refresh
[18/Jul/2020 17:05:05 +0000] 10596 MainThread navigator_plugin INFO     Scheduling a refresh for Profile Plugin for yarn-NODEMANAGER with count 1 pipelines names ['TP'].
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     Instantiating generic monitor for service YARN and role NODEMANAGER
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Begin monitor refresh.
[18/Jul/2020 17:05:05 +0000] 10596 MainThread abstract_monitor INFO     Refreshing GenericMonitor YARN-NODEMANAGER for None
[18/Jul/2020 17:05:05 +0000] 10596 MainThread __init__     INFO     New monitor: (<cmf.monitor.generic.GenericMonitor object at 0x7fa3452d1650>,)
[18/Jul/2020 17:05:05 +0000] 10596 MainThread process      INFO     Daemon refresh complete for process 136-yarn-NODEMANAGER.
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO     stopping Audit Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO     Piplines updated for Audit Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Audit-Plugin navigator_plugin INFO     Refreshing Audit Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO     stopping Metadata Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO     Piplines updated for Metadata Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Metadata-Plugin navigator_plugin INFO     Refreshing Metadata Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO     stopping Profile Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO     Piplines updated for Profile Plugin: []
[18/Jul/2020 17:05:07 +0000] 10596 Profile-Plugin navigator_plugin INFO     Refreshing Profile Plugin for yarn-NODEMANAGER with count 0 pipelines names [].
[18/Jul/2020 17:06:05 +0000] 10596 MonitorDaemon-Scheduler __init__     INFO     Monitor expired: ('GenericMonitor YARN-NODEMANAGER for CD-YARN-PbpthEaX-NODEMANAGER-f2098a33ab399cc2ac267af283ff7430',)

 

 

 

Error that shows connection refused since NodeManager not running

 

 

 

 

 

 

10596 Monitor-GenericMonitor throttling_logger ERROR    (61 skipped) Error fetching metrics at 'http://us-east-1a-test-east-cdh-ta
sknode4241.throtle-test.internal:8042/jmx'
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 200, in _collect_and_pa
rse_and_return
    self._adapter.safety_valve))
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/url_util.py", line 223, in urlopen_with_retry_on_authentication_err
ors
    return function()
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 217, in _open_url
    password=self._password_value)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.16.2-py2.7.egg/cmf/url_util.py", line 70, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>

 

 

 

Here is the log from supervisord.log file

 

 

 

2020-07-18 17:05:05,228 INFO spawned: '136-yarn-NODEMANAGER' with pid 48155
2020-07-18 17:05:26,039 INFO success: 136-yarn-NODEMANAGER entered RUNNING state, process has stayed up for > than 20 seconds (startsecs)

 

 

 

Can any one help me in fix the issue?

 

Thanks & Regards,

 

Nagamalleswara.

8 REPLIES 8

avatar
Contributor

A quick update,

 

Even though I was not able to find the root cause, or fix for this root cause, I just tried rebooting the server running NodeManager service, and now I could see that NodeManager was up and running.

 

But for every service restart I see the same issue of NodeManager not running and had to reboot server to get it up and running.

 

avatar
Moderator

Hello @Nagamalleswara ,

 

thank you for sharing with us the detailed description of how your NM seems to have issues on a certain node.

 

Next time you hit this problem, can you please check if you see multiple NM instances on a single node?

In some cases the CM agent might gets into a bad state.

 

When you hit the issue, please run the following command and see if it showes 2 NM processes running on the host

ssh into the host ;   become root

 

# export SUPER_CONF=/var/run/cloudera-scm-agent/supervisor/supervisord.conf

# /usr/lib64/cmf/agent/build/env/bin/supervisorctl -c $SUPER_CONF status

52372-yarn-NODEMANAGER       RUNNING    pid 71771, uptime 2:21:19

52723-yarn-NODEMANAGER       FATAL      Exited too quickly (process

log may have details)

 

If so, it means that the CM-agent process was bad.

 

In this case please:

1) Proceed to kill the PID. Verify that the PID does not exist and neither is anything else listening on the port.

2) Restart cloudera-scm-agent

 service cloudera-scm-agent restart

3) Start the RM process from CM

 

Please share with us your findings. If you are not hitting the above described issue it still helps us to narrow down further this issue.

 

Please also check if the NM heap size on this node is set to at least 1Gb just to rule out stop the world GC activity rendering the NM unresponsive.

 

Thank you:
Ferenc


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar
Contributor

Thanks @Bender  for detailed instructions.

 

Yes I am still facing this issue and trying to chase this down for last few days.

 

I have tried the commands you've provided and here is the output

 

[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]# export SUPER_CONF=/var/run/cloudera-scm-agent/supervisor/supervisord.conf
[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]# /usr/lib64/cmf/agent/build/env/bin/supervisorctl -c $SUPER_CONF status
138-cluster-host-inspector       EXITED     Jul 21 03:00 PM
201-hdfs-DATANODE                RUNNING    pid 96936, uptime 0:11:58
209-hbase-REGIONSERVER           RUNNING    pid 97227, uptime 0:11:31
214-yarn-NODEMANAGER             RUNNING    pid 97307, uptime 0:11:30
cmflistener                      RUNNING    pid 7412, uptime 4:53:45
flood                            RUNNING    pid 7449, uptime 4:53:44
[root@us-east-1a-test-east-cdh-corenode4163 cloudera-scm-agent]#

 

So in my case I don't have problem of two NM instances running in a single node. Also here is the process output from ps command

 

yarn      97307  0.0  0.0 2858224 23704 ?       Sl   15:22   0:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-LGzbJezU_CD-YARN-LGzbJezU-NODEMANAGER-f953f0c79fd5345f10fb347aa90e7500_pid97307.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-LGzbJezU-NODEMANAGER-us-east-1a-test-east-cdh-corenode4163.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-LGzbJezU-NODEMANAGER-us-east-1a-test-east-cdh-corenode4163.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/214-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager

 

Yes, the NodeManager by default given 1 GB , and I even tried giving 2 GB & 4 GB still the issue persists.

 

Please note that I am seeing this issue on random servers, for every restart of cluster/YARN service. ( As mentioned earlier the process will be running but no logs are written, just cloudera-scm-agent complains that it's not able to connect to 8042 port )

 

Could you provide any other ways/method to debug this issue?

 

Note :

 

Not sure if it helps, when we tested cluster with small nodes ( r4.xlarge ) we're not seeing this issue. We are seeing this issue when we increased size of the node to r4.8xlarge

avatar
Moderator

Hello @Nagamalleswara ,

 

What is odd is that the issue happens on large AWS nodes only and intermittently. Did you check if ports are open bi-directionally on the affected nodes for NMs?

Can you access the NM WebUI on the affected node, please?

 

Thank you:

Ferenc


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar
Contributor

Hi @Bender 

 

Yes we checked ports and verified that they're accessible., and yes it's very strange to see issue like this.

 

I was not able to do anything since the process ( yarn NODEMANAGER ) will be running but does nothing like not even writing logs.

 

In some cases reboot of the server works and some cases service restart works. But in some cases even server reboot doesn't work.

 

I have tried following options

 

  1.  Tried  changing log level to DEBUG and see if that writes any logs -- But it didn't work or didn't write any log
  2. Since the Java process is running ( supervisor process thinks that NM was running since process is running and stayed up > 20 sec ) , I thought of getting thread dump analyze if I could be able to get any thing.
  3. To get thread dump I used jcmd, but after running this it's killing the process ( that was not running/ not producing logs ) and a new process spun up. ( even if I do the same on new process it's killing again )

 

jcmd <pid> Thread.print >> /path/to/file​

 

  • I have tried to see if there were any deadlocks with jstack -F and the result it showed is there were no deadlocks

Please let me know if I can check anything else to resolve the issue.

avatar
Moderator

Hello @Nagamalleswara ,

 

please make sure you are using the yarn user when collecting the thread dumps.

 

I've found useful this description too for figuring out what is going on with the JVM.

 

Kind regards:
Ferenc


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar
Contributor

Thanks for providing the link for JVM analysis. it's really helpful.

 

Yes, in my case the process was not responding so I should've used `kill -3` option.

 

I will try next time and provide you results.

avatar
Contributor

Hi @Bender 

 

As provided in the link I tried to produce thread & dump files from the running process, but as I mentioned earlier those process were getting killed/throwing error.

 

Here is the output am getting when I run jmap as per the doc/link provided

 

[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ ps -fe | grep nodemanager
yarn      5235 12503  0 13:07 ?        00:00:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-QafZaOEK_CD-YARN-QafZaOEK-NODEMANAGER-4756e03a64cd1a4e535550d4cd740b08_pid5235.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
yarn      5240  5235  0 13:07 ?        00:00:00 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-redactor /usr/lib64/cmf/service/yarn/yarn.sh nodemanager
yarn      5487 31141  0 13:07 pts/0    00:00:00 grep --color=auto nodemanager
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jmap -heap 5235 > /tmp/jmap_5235_heap.out
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.tools.jmap.JMap.runTool(JMap.java:201)
        at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.NullPointerException
        at sun.jvm.hotspot.tools.HeapSummary.run(HeapSummary.java:157)
        at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
        at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
        at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
        at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:50)
        ... 6 more

 

Here is the error message I am getting when I run jstack with -l option

 

[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ ps -fe | grep nodemanager
yarn      4518 12503  0 13:04 ?        00:00:00 /usr/lib/jvm/java-openjdk/bin/java -Dproc_nodemanager -Xmx1000m -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dlibrary.leveldbjni.path=/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER -Dhadoop.event.appender=,EventCatcher -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/CD-YARN-QafZaOEK_CD-YARN-QafZaOEK-NODEMANAGER-4756e03a64cd1a4e535550d4cd740b08_pid4518.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.log.file=hadoop-cmf-CD-YARN-QafZaOEK-NODEMANAGER-us-east-1a-test-east-cdh-tasknode5152.throtle-test.internal.log.out -Dyarn.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/native -classpath /run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-mapreduce/.//*:/usr/share/cmf/lib/plugins/event-publish-5.16.2-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.16.2.jar:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/.//*:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop-yarn/lib/*:/run/cloudera-scm-agent/process/109-yarn-NODEMANAGER/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager
yarn      4523  4518  0 13:04 ?        00:00:00 python2.7 /usr/lib64/cmf/agent/build/env/bin/cmf-redactor /usr/lib64/cmf/service/yarn/yarn.sh nodemanager
yarn      4717 31141  0 13:05 pts/0    00:00:00 grep --color=auto nodemanager
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jstack -F 4518 > /tmp/jstack_4518_f.out
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ less /tmp/jstack_4518_f.out
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$
[yarn@us-east-1a-test-east-cdh-tasknode5152 process]$ /usr/lib/jvm/java-openjdk/bin/jstack -l 4518 > /tmp/jstack_4518_f.out
4518: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding

 

For the process 4518 I ran jstack with -F option and here is the output

 

Debugger attached successfully.
Server compiler detected.
JVM version is 25.181-b13
Deadlock Detection:

No deadlocks found.