Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Ambari Show NodeManager Down, However ...

Explorer

Hello,

I have setup a 6 node cluster (2M , 3 D and 1E) node. Cluster has been setup smoothly without any issue. However, i can see NodeManager getting down on Ambari. From ResourceManager UI, I can see 1 node was active and other 2 nodemanager were down.

I restarted YARN services after executing below statements on all nodes

rm -rf * (from /var/log/hadoop-yarn/nodemanager/recovery-state directory). Found this solution on some forum

After restarting YARN i can see all NodeManagers UP in resourcemanager but Ambari still showing Down alerts. Below are alerts gernerated

Connection failed to http://ip-172-31-32-138.us-west-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute
    url_response = urllib2.urlopen(query, timeout=connection_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)



Can anyone let me know how to get rid of this alert/error on Ambari.

Thanks and Regards,
Laeeq -

8 REPLIES 8

Explorer

Just found that all NodeManagers are down again. Can anyone please provide the fix ?

Hello @Laeeq Ahmad !
Could you check the output from the following command?

netstat -tunlp |grep 8042
There's a couple of things that may help us to find the issue:
- Check if you've a firewall enabled
- Check your FQDN and if it matches with ip-172-31-32-138.us-west-2.compute.internal

- Take a look at the logs from nodemanager: /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<hostname>.log try to find any ERROR/WARN
Hope this helps!

Explorer

Thanks Vinicius,

I found that nodemanager was not running on Datanodes so i have manually start that using below command

/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager

However, it is still getting down again and again

Though there are container related error/warning but still not able to crack the root cause.

Regards,

Laeeq -

Good to know, you made some progress there 🙂
Okay, is there any error code? Usually, Yarn throws a number or a few lines of classes/methods. Could you share the output from the logs?

Thanks.

Explorer

yes e.g.

2018-07-15 17:52:52,377 WARNnodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e01_1531677114201_0002_01_000001 is : 143

2018-07-15 18:22:17,588 WARNlauncher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(113)) - Recovered container exited with a non-zero exit code 154

2018-07-15 18:22:19,193 WARNlogaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(230)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users.

2018-07-15 17:51:39,597 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM

2018-07-15 18:22:17,587 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(98)) - Unable to recover container container_e01_1531677114201_0001_01_000001

Explorer

Also i am still getting the same alert on Ambari

Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last):

  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute
    url_response = urllib2.urlopen(query, timeout=connection_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)

Hi @Laeeq Ahmad.
Okay, so this time you're having issues with another FQDN right?
That one before was complaining about the ip-172-31-32-138.us-west-2.compute.internal and now it's the temp.tem1.org.
So let's check if the nodemanager hosts (set in Ambari) matches with the

cat /etc/sysconfig/network 
cat /etc/hosts
hostname --fqdn

Now regarding the warn/error msgs:

143 => This error afaik usually is related to memory misconfiguration, take a look at this link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/determ... and also through ambari it's possible to "set the recommendation" for most of the parameters 🙂

Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. => Try to add the sticky bit to your yarn.nodemanager.remote-app-log-dir.

154 => Perhaps this link explains what's going on here https://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/

PS: whenever the Nodemanager crashes, check if the PID in /var/run/hadoop-yarn/yarn/ didn't get stuck.
Hope this helps!

Explorer

Hi @Vinicius Higa Murakami,

No Actually i have created new cluster thats why you are seeing two different hostnames.

Below are host file entries

127.0.0.1 localhost.localdomain localhost

::1 localhost6.localdomain6 localhost6

10.162.96.45 temp.tem1.org

[root@jazz1 ~]# cat /etc/sysconfig/network

# Created by cloud-init on instance boot automatically, do not edit.

#

NETWORKING=yes

hostname=temp.tem1.org

[root@jazz1 ~]# hostname --fqdn

temp.tem1.org

Above are all the required information. However, i am still getting below error and also Nodemanager is going down again

NodeManager WEbUI
Connection failed to http://temp.tem1.org:8042 (<urlopen error [Errno 111] Connection refused>)

NodeManager Health
Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )