Support Questions

Find answers, ask questions, and share your expertise

Ambari Show NodeManager Down, However ...

avatar
Contributor

Hello,

I have setup a 6 node cluster (2M , 3 D and 1E) node. Cluster has been setup smoothly without any issue. However, i can see NodeManager getting down on Ambari. From ResourceManager UI, I can see 1 node was active and other 2 nodemanager were down.

I restarted YARN services after executing below statements on all nodes

rm -rf * (from /var/log/hadoop-yarn/nodemanager/recovery-state directory). Found this solution on some forum

After restarting YARN i can see all NodeManagers UP in resourcemanager but Ambari still showing Down alerts. Below are alerts gernerated

Connection failed to http://ip-172-31-32-138.us-west-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute
    url_response = urllib2.urlopen(query, timeout=connection_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)



Can anyone let me know how to get rid of this alert/error on Ambari.

Thanks and Regards,
Laeeq -

8 REPLIES 8

avatar
Contributor

Just found that all NodeManagers are down again. Can anyone please provide the fix ?

avatar

Hello @Laeeq Ahmad !
Could you check the output from the following command?

netstat -tunlp |grep 8042
There's a couple of things that may help us to find the issue:
- Check if you've a firewall enabled
- Check your FQDN and if it matches with ip-172-31-32-138.us-west-2.compute.internal

- Take a look at the logs from nodemanager: /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<hostname>.log try to find any ERROR/WARN
Hope this helps!

avatar
Contributor

Thanks Vinicius,

I found that nodemanager was not running on Datanodes so i have manually start that using below command

/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager

However, it is still getting down again and again

Though there are container related error/warning but still not able to crack the root cause.

Regards,

Laeeq -

avatar

Good to know, you made some progress there 🙂
Okay, is there any error code? Usually, Yarn throws a number or a few lines of classes/methods. Could you share the output from the logs?

Thanks.

avatar
Contributor

yes e.g.

2018-07-15 17:52:52,377 WARNnodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e01_1531677114201_0002_01_000001 is : 143

2018-07-15 18:22:17,588 WARNlauncher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(113)) - Recovered container exited with a non-zero exit code 154

2018-07-15 18:22:19,193 WARNlogaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(230)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users.

2018-07-15 17:51:39,597 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM

2018-07-15 18:22:17,587 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(98)) - Unable to recover container container_e01_1531677114201_0001_01_000001

avatar
Contributor

Also i am still getting the same alert on Ambari

Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last):

  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute
    url_response = urllib2.urlopen(query, timeout=connection_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)

avatar

Hi @Laeeq Ahmad.
Okay, so this time you're having issues with another FQDN right?
That one before was complaining about the ip-172-31-32-138.us-west-2.compute.internal and now it's the temp.tem1.org.
So let's check if the nodemanager hosts (set in Ambari) matches with the

cat /etc/sysconfig/network 
cat /etc/hosts
hostname --fqdn

Now regarding the warn/error msgs:

143 => This error afaik usually is related to memory misconfiguration, take a look at this link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/determ... and also through ambari it's possible to "set the recommendation" for most of the parameters 🙂

Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. => Try to add the sticky bit to your yarn.nodemanager.remote-app-log-dir.

154 => Perhaps this link explains what's going on here https://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/

PS: whenever the Nodemanager crashes, check if the PID in /var/run/hadoop-yarn/yarn/ didn't get stuck.
Hope this helps!

avatar
Contributor

Hi @Vinicius Higa Murakami,

No Actually i have created new cluster thats why you are seeing two different hostnames.

Below are host file entries

127.0.0.1 localhost.localdomain localhost

::1 localhost6.localdomain6 localhost6

10.162.96.45 temp.tem1.org

[root@jazz1 ~]# cat /etc/sysconfig/network

# Created by cloud-init on instance boot automatically, do not edit.

#

NETWORKING=yes

hostname=temp.tem1.org

[root@jazz1 ~]# hostname --fqdn

temp.tem1.org

Above are all the required information. However, i am still getting below error and also Nodemanager is going down again

NodeManager WEbUI
Connection failed to http://temp.tem1.org:8042 (<urlopen error [Errno 111] Connection refused>)

NodeManager Health
Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )