Support Questions
Find answers, ask questions, and share your expertise

Nodemanager bad health and connection refused

Contributor

@Jay Kumar SenSharma maybe you can help me with this one instead?

I have a 4-node cluster. All four are datanodes and one node is also the resource-manager. My ambari installation only installed a node-manager on my master resource-manager node. Assuming this is correct (please let me know if it is not), I have been getting errors about my node-manager. It says the health is bad because it cannot connect:

Connection failed to http://ncienspk01.nciwin.local:8042/ws/v1/node/info (Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/alerts/alert_nodemanager_health.py", line 171, in execute
    url_response = urllib2.urlopen(query, timeout=connection_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)

Many of my services had corrupt installs and I did a re-install. That may be the case here as well. Thoughts on how to re-install?
Also- should I have a node-manager on every node? If so how do I install them and connect them.

Thanks for your help! Dan

2 ACCEPTED SOLUTIONS

Contributor

@Jay Kumar SenSharma
When I try to start services now I'm getting:

For HDFS Client Install

RuntimeError: Failed to execute command '/usr/bin/yum -y install hadoop_3_0_0_0_1634', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64'

For Hive Client Install

RuntimeError: Failed to execute command '/usr/bin/yum -y install hive_3_0_0_0_1634-hcatalog', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64

View solution in original post

Contributor

So I resolved all this. I just followed the steps here to remove all my packages, then deleted the contents of my files:

rm -rf /usr/hdp/
Then in Ambari I used the "Start all Services" command and it went through and installed everything again for me.

Then to solve the nodemanager issue I did the spark-yarn- install which jave me the missing jar that I needed and then just copied that dir:
/usr/hdp/3.0.0.0-.../spark2/aux/
to all the other nodes in my cluster. Now all my nodemanagers are coming up and things are looking good.
I'm creating another post about resolving my Timeline Service V2.0 issue which is somehow still persisting.

View solution in original post

13 REPLIES 13

Super Mentor

@Daniel Zafar

Error indicates that Nodemanager is not started successfully or might be down hence the port 8042 is not accessible.

May be you can try starting the NodeManager manually using command line to isolate the issue (if it starts fine without ambari) Because ambari also performs the Nodemager health validation during startup.

# su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager"

Then verify if the port 8042 is opened or not?

# netstat -tnlpa | grep 8042

.

Also once the NodeManager is started via command line then please check the NodeManager logs and Free Memory available on the host.

Logs:

/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-*.log<br>/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-*.out

Memory:

# ps -ef | grep `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid`
# $JAVA_HOME/bin/jmap -heap `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid`
# free -m

.

NodeManager can be installed on all cluster nodes as well so that we have more Nodes available from ResourceManager. However for 4 node cluster i would suggest that better to install it on all 4 nodes (or at least 3 nodes). Instsalling NodeManager on a single node might cause very slow processing of your Jobs.

.

Contributor

@Jay Kumar SenSharma

I was able to start the node-manager from the command line with no issue.

[root@NCIENSPK01 ~]#  su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager"
WARNING: Use of this script to start YARN daemons is deprecated.
WARNING: Attempting to execute replacement "yarn --daemon start" instead.
[root@NCIENSPK01 ~]#

port?

[root@NCIENSPK01 ~]# netstat -tnlpa | grep 8042
[root@NCIENSPK01 ~]#

memory?

[root@NCIENSPK01 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          40072        6305       30089          51        3678       33190
Swap:          8063           0        8063

Can you please show me how to re-install nodemanager on this node and how to do a fresh install and any configuration for the other nodes?

Super Mentor

@Daniel Zafar

Your NodeManager command execution was fine however the Netstat command did not show any Port Listening on 8042 means the NodeManager was not actually started successfully.

# netstat -tnlpa | grep 8042

.

Can you please check and share the NM logs.

Also regarding Installing NodeManager on other nodes ... it is quite easy and can be done via ambari UI as following:

Ambari UI --> Hosts (Tab) --> Click on the desired host link --> Click "Add" button (on the Components Panel) and then choose NodeManager from the drop down

86436-add-nodemanager.png

.

Similarly if you want to delete a NodeManager from a particular host then do the same:

Ambari UI --> Hosts (Tab) --> Click on the desired host link --> On the host page Click on the "NodeManager" dropdown menu. After Stopping NodeManager you will see option to "Delete" the NodeManager.

86435-delete-nm.png

.

Contributor

@Jay Kumar SenSharmaI think it's pretty clear that I have an issue with my NodeManager and need to re-install it. Other things as well?

Here are my logs:

2018-08-09 17:18:33,029 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed in state INITED
java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
2018-08-09 17:18:33,030 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more



2018-08-09 17:18:33,031 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2018-08-09 17:18:33,032 INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)) - Stopping NodeManager metrics system...
2018-08-09 17:18:33,032 INFO  impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted.
2018-08-09 17:18:33,034 INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)) - NodeManager metrics system stopped.
2018-08-09 17:18:33,034 INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)) - NodeManager metrics system shutdown complete.
2018-08-09 17:18:33,034 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(932)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
        at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2018-08-09 17:18:33,036 INFO  nodemanager.NodeManager (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at NCIENSPK01.nciwin.local/10.96.26.90
************************************************************/
<br>

Contributor
@Jay Kumar SenSharma

As you instructed I deleted nodemanager from the main node then added it to all four nodes. Now I have a node manager on each node. Unfortunately none of them work. I still get the above errors on each node. They all have the same lines:

ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(932)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
...

It seems like the install of YARN is corrupted as it is missing this core class. Is that correct? What is the solution? It's also worth mentioning that my YARN Timeline Service has never worked. I have it on maintenance mode so that I would be able to start the cluster. Maybe that is a symptom of the present issue?

Super Mentor

@Daniel Zafar

Do you have the following kind of JAR presnet in your cluster? The version might be slightly different in your case.

/usr/hdp/3.0.0.0-1634/spark2/aux/spark-2.3.1.3.0.0.0-1634-yarn-shuffle.jar

.

Do you have the Spark2 Installed to your cluster?

Please check your "yarn.nodemanager.aux-services" property of YARN service and then you will find the following value .. it might be including the spark2 shuffle

mapreduce_shuffle,spark2_shuffle,{{timeline_collector}}

.

Rising Star

Spark yarn shuffle jar is missing from your server which is causing node manager failure.

Please check paths

If you have spark installed: /usr/hdp/<hdp-version>/spark/aux/

If you have spark2 installed /usr/hdp/<hdp-version>/spark2/aux/

Similar to spark-<sparkversion>.<hdpversion>-yarn-shuffle.jar

If this file is not present then you can copy that jar from your any other host where nodemanger is working fine

Just copy that jar in that path and start the nodemanger service

Contributor

@Jay Kumar SenSharma

Thanks for troubleshooting with me. I don't have the jar you pointed at:

[root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2
aux   data      jars      NOTICE  README.md  standalone-metastore
bin   doc       LICENSE   python  RELEASE    work
conf  examples  licenses  R       sbin       yarn
[root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2/aux
[root@NCIENSPK01 ~]#

Here is that config:
for yarn.nodemanager.aux-services I have the following present:

mapreduce_shuffle,spark2_shuffle,{{timeline_collector}}

What is the next step? Should I re-install spark2?

@Pankaj Kadam I do not have any nodemanagers working in my cluster. I believe there was a corrupt installation. I have not yet run a successful job on this cluster.

Note Timeline Service Reader V2.0 is also failing with error:

resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export  PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/lib64/qt-3.3/bin:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/maven/bin:/root/bin:/opt/maven/bin:/opt/maven/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds

Contributor
@Jay Kumar SenSharma

A few updates....

I used the commands:

yum remove spark2_3_0_0_0_1634-yarn-shuffle
yum install spark2_3_0_0_0_1634-yarn-shuffle

to re-install spark2 yarn shuffle and like magic I found the jar:

[root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2/aux/
spark-2.3.1.3.0.0.0-1634-yarn-shuffle.jar

BUT UNFORTUNATELY this deleted a lot of my core packages. So I had to re-install lots of core files from repo:

yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl

Now I'm getting this error when I try to start resourcemanager and nodemanager

resource_management.core.exceptions.ExecutionFailed: Execution of 'ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.0.0-1634/hadoop/libexec && /usr/hdp/3.0.0.0-1634/hadoop-yarn/bin/yarn --config /usr/hdp/3.0.0.0-1634/hadoop/conf --daemon start nodemanager' returned 1. ERROR: Hadoop common not found.


Please help 😞

Contributor

@Jay Kumar SenSharma
When I try to start services now I'm getting:

For HDFS Client Install

RuntimeError: Failed to execute command '/usr/bin/yum -y install hadoop_3_0_0_0_1634', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64'

For Hive Client Install

RuntimeError: Failed to execute command '/usr/bin/yum -y install hive_3_0_0_0_1634-hcatalog', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64

Contributor

@Jay Kumar SenSharma I'm definitely in a jam now. Really hoping you can help me. A bit scared to touch anything at this point.

Contributor

So I resolved all this. I just followed the steps here to remove all my packages, then deleted the contents of my files:

rm -rf /usr/hdp/
Then in Ambari I used the "Start all Services" command and it went through and installed everything again for me.

Then to solve the nodemanager issue I did the spark-yarn- install which jave me the missing jar that I needed and then just copied that dir:
/usr/hdp/3.0.0.0-.../spark2/aux/
to all the other nodes in my cluster. Now all my nodemanagers are coming up and things are looking good.
I'm creating another post about resolving my Timeline Service V2.0 issue which is somehow still persisting.

Rising Star

I'm glad that all sorted now another way was deleting the particular node from the cluster and then readding it and after adding spark client on it. I have recently done that one of my test cluster recently and it worked

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.