Created 07-14-2018 11:20 AM
Hello,
I have setup a 6 node cluster (2M , 3 D and 1E) node. Cluster has been setup smoothly without any issue. However, i can see NodeManager getting down on Ambari. From ResourceManager UI, I can see 1 node was active and other 2 nodemanager were down.
I restarted YARN services after executing below statements on all nodes
rm -rf * (from /var/log/hadoop-yarn/nodemanager/recovery-state directory). Found this solution on some forum
After restarting YARN i can see all NodeManagers UP in resourcemanager but Ambari still showing Down alerts. Below are alerts gernerated
Connection failed to http://ip-172-31-32-138.us-west-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
Can anyone let me know how to get rid of this alert/error on Ambari.
Thanks and Regards,
Laeeq -
Created 07-14-2018 11:23 AM
Just found that all NodeManagers are down again. Can anyone please provide the fix ?
Created 07-15-2018 05:33 AM
Hello @Laeeq Ahmad !
Could you check the output from the following command?
netstat -tunlp |grep 8042There's a couple of things that may help us to find the issue:
- Take a look at the logs from nodemanager: /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<hostname>.log try to find any ERROR/WARN
Hope this helps!
Created 07-15-2018 06:29 AM
Thanks Vinicius,
I found that nodemanager was not running on Datanodes so i have manually start that using below command
/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager
However, it is still getting down again and again
Though there are container related error/warning but still not able to crack the root cause.
Regards,
Laeeq -
Created 07-15-2018 06:34 AM
Good to know, you made some progress there 🙂
Okay, is there any error code? Usually, Yarn throws a number or a few lines of classes/methods. Could you share the output from the logs?
Thanks.
Created 07-15-2018 06:27 PM
yes e.g.
2018-07-15 17:52:52,377 WARNnodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e01_1531677114201_0002_01_000001 is : 143
2018-07-15 18:22:17,588 WARNlauncher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(113)) - Recovered container exited with a non-zero exit code 154
2018-07-15 18:22:19,193 WARNlogaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRemoteLogDir(230)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users.
2018-07-15 17:51:39,597 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-07-15 18:22:17,587 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(98)) - Unable to recover container container_e01_1531677114201_0001_01_000001
Created 07-16-2018 12:08 AM
Also i am still getting the same alert on Ambari
Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
Created 07-16-2018 05:47 AM
Hi @Laeeq Ahmad.
Okay, so this time you're having issues with another FQDN right?
That one before was complaining about the ip-172-31-32-138.us-west-2.compute.internal and now it's the temp.tem1.org.
So let's check if the nodemanager hosts (set in Ambari) matches with the
cat /etc/sysconfig/network cat /etc/hosts hostname --fqdn
Now regarding the warn/error msgs:
143 => This error afaik usually is related to memory misconfiguration, take a look at this link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/determ... and also through ambari it's possible to "set the recommendation" for most of the parameters 🙂
Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. => Try to add the sticky bit to your yarn.nodemanager.remote-app-log-dir.
154 => Perhaps this link explains what's going on here https://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/
PS: whenever the Nodemanager crashes, check if the PID in /var/run/hadoop-yarn/yarn/ didn't get stuck.
Hope this helps!
Created 07-17-2018 02:44 AM
No Actually i have created new cluster thats why you are seeing two different hostnames.
Below are host file entries
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.162.96.45 temp.tem1.org
[root@jazz1 ~]# cat /etc/sysconfig/network
# Created by cloud-init on instance boot automatically, do not edit.
#
NETWORKING=yes
hostname=temp.tem1.org
[root@jazz1 ~]# hostname --fqdn
Above are all the required information. However, i am still getting below error and also Nodemanager is going down again
NodeManager WEbUI Connection failed to http://temp.tem1.org:8042 (<urlopen error [Errno 111] Connection refused>)
NodeManager Health
Connection failed to http://temp.tem1.org:8042/ws/v1/node/info (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute
url_response = urllib2.urlopen(query, timeout=connection_timeout)
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 111] Connection refused>
)