Created 08-09-2018 10:07 PM
@Jay Kumar SenSharma maybe you can help me with this one instead?
I have a 4-node cluster. All four are datanodes and one node is also the resource-manager. My ambari installation only installed a node-manager on my master resource-manager node. Assuming this is correct (please let me know if it is not), I have been getting errors about my node-manager. It says the health is bad because it cannot connect:
Connection failed to http://ncienspk01.nciwin.local:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
Many of my services had corrupt installs and I did a re-install. That may be the case here as well. Thoughts on how to re-install?
Also- should I have a node-manager on every node? If so how do I install them and connect them.
Thanks for your help! Dan
Created 08-10-2018 07:29 PM
@Jay Kumar SenSharma
When I try to start services now I'm getting:
For HDFS Client Install
RuntimeError: Failed to execute command '/usr/bin/yum -y install hadoop_3_0_0_0_1634', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64'
For Hive Client Install
RuntimeError: Failed to execute command '/usr/bin/yum -y install hive_3_0_0_0_1634-hcatalog', exited with code '1', message: 'Error unpacking rpm package hadoop_3_0_0_0_1634-3.1.0.3.0.0.0-1634.x86_64
Created 08-11-2018 12:05 AM
So I resolved all this. I just followed the steps here to remove all my packages, then deleted the contents of my files:
rm -rf /usr/hdp/Then in Ambari I used the "Start all Services" command and it went through and installed everything again for me.
/usr/hdp/3.0.0.0-.../spark2/aux/to all the other nodes in my cluster. Now all my nodemanagers are coming up and things are looking good.
Created 08-09-2018 10:15 PM
Error indicates that Nodemanager is not started successfully or might be down hence the port 8042 is not accessible.
May be you can try starting the NodeManager manually using command line to isolate the issue (if it starts fine without ambari) Because ambari also performs the Nodemager health validation during startup.
# su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager"
Then verify if the port 8042 is opened or not?
# netstat -tnlpa | grep 8042
.
Also once the NodeManager is started via command line then please check the NodeManager logs and Free Memory available on the host.
Logs:
/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-*.log<br>/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-*.out
Memory:
# ps -ef | grep `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` # $JAVA_HOME/bin/jmap -heap `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` # free -m
.
NodeManager can be installed on all cluster nodes as well so that we have more Nodes available from ResourceManager. However for 4 node cluster i would suggest that better to install it on all 4 nodes (or at least 3 nodes). Instsalling NodeManager on a single node might cause very slow processing of your Jobs.
.
Created 08-09-2018 10:26 PM
I was able to start the node-manager from the command line with no issue.
[root@NCIENSPK01 ~]# su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager" WARNING: Use of this script to start YARN daemons is deprecated. WARNING: Attempting to execute replacement "yarn --daemon start" instead. [root@NCIENSPK01 ~]#
port?
[root@NCIENSPK01 ~]# netstat -tnlpa | grep 8042 [root@NCIENSPK01 ~]#
memory?
[root@NCIENSPK01 ~]# free -m total used free shared buff/cache available Mem: 40072 6305 30089 51 3678 33190 Swap: 8063 0 8063
Can you please show me how to re-install nodemanager on this node and how to do a fresh install and any configuration for the other nodes?
Created on 08-09-2018 10:34 PM - edited 08-17-2019 07:47 PM
Your NodeManager command execution was fine however the Netstat command did not show any Port Listening on 8042 means the NodeManager was not actually started successfully.
# netstat -tnlpa | grep 8042
.
Can you please check and share the NM logs.
Also regarding Installing NodeManager on other nodes ... it is quite easy and can be done via ambari UI as following:
Ambari UI --> Hosts (Tab) --> Click on the desired host link --> Click "Add" button (on the Components Panel) and then choose NodeManager from the drop down
.
Similarly if you want to delete a NodeManager from a particular host then do the same:
Ambari UI --> Hosts (Tab) --> Click on the desired host link --> On the host page Click on the "NodeManager" dropdown menu. After Stopping NodeManager you will see option to "Delete" the NodeManager.
.
Created 08-09-2018 11:40 PM
@Jay Kumar SenSharmaI think it's pretty clear that I have an issue with my NodeManager and need to re-install it. Other things as well?
Here are my logs:
2018-08-09 17:18:33,029 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed in state INITED java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) 2018-08-09 17:18:33,030 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ... 8 more 2018-08-09 17:18:33,031 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ... 8 more 2018-08-09 17:18:33,032 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)) - Stopping NodeManager metrics system... 2018-08-09 17:18:33,032 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 2018-08-09 17:18:33,034 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)) - NodeManager metrics system stopped. 2018-08-09 17:18:33,034 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)) - NodeManager metrics system shutdown complete. 2018-08-09 17:18:33,034 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(932)) - Error starting NodeManager org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:167) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ... 8 more 2018-08-09 17:18:33,036 INFO nodemanager.NodeManager (LogAdapter.java:info(51)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NodeManager at NCIENSPK01.nciwin.local/10.96.26.90 ************************************************************/ <br>
Created 08-10-2018 05:09 AM
As you instructed I deleted nodemanager from the main node then added it to all four nodes. Now I have a node manager on each node. Unfortunately none of them work. I still get the above errors on each node. They all have the same lines:
ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(932)) - Error starting NodeManager org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:473) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService ...
It seems like the install of YARN is corrupted as it is missing this core class. Is that correct? What is the solution? It's also worth mentioning that my YARN Timeline Service has never worked. I have it on maintenance mode so that I would be able to start the cluster. Maybe that is a symptom of the present issue?
Created 08-10-2018 05:13 AM
Do you have the following kind of JAR presnet in your cluster? The version might be slightly different in your case.
/usr/hdp/3.0.0.0-1634/spark2/aux/spark-2.3.1.3.0.0.0-1634-yarn-shuffle.jar
.
Do you have the Spark2 Installed to your cluster?
Please check your "yarn.nodemanager.aux-services" property of YARN service and then you will find the following value .. it might be including the spark2 shuffle
mapreduce_shuffle,spark2_shuffle,{{timeline_collector}}
.
Created 08-10-2018 05:49 AM
Spark yarn shuffle jar is missing from your server which is causing node manager failure.
Please check paths
If you have spark installed: /usr/hdp/<hdp-version>/spark/aux/
If you have spark2 installed /usr/hdp/<hdp-version>/spark2/aux/
Similar to spark-<sparkversion>.<hdpversion>-yarn-shuffle.jar
If this file is not present then you can copy that jar from your any other host where nodemanger is working fine
Just copy that jar in that path and start the nodemanger service
Created 08-10-2018 03:13 PM
Thanks for troubleshooting with me. I don't have the jar you pointed at:
[root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2 aux data jars NOTICE README.md standalone-metastore bin doc LICENSE python RELEASE work conf examples licenses R sbin yarn [root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2/aux [root@NCIENSPK01 ~]#
Here is that config:
for yarn.nodemanager.aux-services I have the following present:
mapreduce_shuffle,spark2_shuffle,{{timeline_collector}}
What is the next step? Should I re-install spark2?
@Pankaj Kadam I do not have any nodemanagers working in my cluster. I believe there was a corrupt installation. I have not yet run a successful job on this cluster.
Note Timeline Service Reader V2.0 is also failing with error:
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/lib64/qt-3.3/bin:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/maven/bin:/root/bin:/opt/maven/bin:/opt/maven/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
Created 08-10-2018 05:54 PM
A few updates....
I used the commands:
yum remove spark2_3_0_0_0_1634-yarn-shuffle yum install spark2_3_0_0_0_1634-yarn-shuffle
to re-install spark2 yarn shuffle and like magic I found the jar:
[root@NCIENSPK01 ~]# ls /usr/hdp/3.0.0.0-1634/spark2/aux/ spark-2.3.1.3.0.0.0-1634-yarn-shuffle.jar
BUT UNFORTUNATELY this deleted a lot of my core packages. So I had to re-install lots of core files from repo:
yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl
Now I'm getting this error when I try to start resourcemanager and nodemanager
resource_management.core.exceptions.ExecutionFailed: Execution of 'ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/3.0.0.0-1634/hadoop/libexec && /usr/hdp/3.0.0.0-1634/hadoop-yarn/bin/yarn --config /usr/hdp/3.0.0.0-1634/hadoop/conf --daemon start nodemanager' returned 1. ERROR: Hadoop common not found.
Please help 😞