Support Questions
Find answers, ask questions, and share your expertise

All Node managers are down in my hadoop cluster built by Ambari

New Contributor

Hi Using Ambari, i have setup hadoop cluster on 3 ec2 m4.2xlarge instances. For few months it has worked fine. But later on i found that node managers are gettin shutdown.

I have tried few solutions mentioned in this area like deleting: /var/log/hadoop-yarn/nodemanager/recovery-state/ and triying to restart again. But no luck.

On all nodes i found that cpu usage went up to 600%:

top - 08:31:21 up 3:35, 1 user, load average: 10.42, 10.60, 10.81
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32780760 total, 14747292 free, 15611484 used, 2421984 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 16513908 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20737 yarn 20 0 314724 22880 392 S 780.1 0.1 1401:35 java
1576 root 20 0 8040888 1.226g 13420 S 16.2 3.9 23:19.59 java
5189 hdfs 20 0 2946168 408120 24552 S 0.7 1.2 0:35.23 java

[root@ip-172-31-37-158 ~]# ps -ef | grep 20737
yarn 20737 1 99 05:26 ? 23:24:42 /var/tmp/java -c /var/tmp/w.conf

Not sure what is the process belonging to "/var/tmp/java -c /var/tmp/w.conf".

In the Nodemanager logs i could see below error messages:

/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-06 03:12:24,348 ERROR filecontroller.LogAggregationFileController (LogAggregationFileController.java:run(360)) - Failed to setup application log directory for application_1528245572429_0439
/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-07 02:53:24,909 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM

In the ambari UI, i could see below alerts on all machines:

Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042 (<urlopen error [Errno 111] Connection refused>)
Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
1 REPLY 1

New Contributor

@Geoffrey Shelton Okot

Any idea on this problem?