Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

All Node managers are down in my hadoop cluster built by Ambari

All Node managers are down in my hadoop cluster built by Ambari

New Contributor

Hi Using Ambari, i have setup hadoop cluster on 3 ec2 m4.2xlarge instances. For few months it has worked fine. But later on i found that node managers are gettin shutdown.

I have tried few solutions mentioned in this area like deleting: /var/log/hadoop-yarn/nodemanager/recovery-state/ and triying to restart again. But no luck.

On all nodes i found that cpu usage went up to 600%:

top - 08:31:21 up 3:35, 1 user, load average: 10.42, 10.60, 10.81
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32780760 total, 14747292 free, 15611484 used, 2421984 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 16513908 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20737 yarn 20 0 314724 22880 392 S 780.1 0.1 1401:35 java
1576 root 20 0 8040888 1.226g 13420 S 16.2 3.9 23:19.59 java
5189 hdfs 20 0 2946168 408120 24552 S 0.7 1.2 0:35.23 java

[root@ip-172-31-37-158 ~]# ps -ef | grep 20737
yarn 20737 1 99 05:26 ? 23:24:42 /var/tmp/java -c /var/tmp/w.conf

Not sure what is the process belonging to "/var/tmp/java -c /var/tmp/w.conf".

In the Nodemanager logs i could see below error messages:

/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-06 03:12:24,348 ERROR filecontroller.LogAggregationFileController (LogAggregationFileController.java:run(360)) - Failed to setup application log directory for application_1528245572429_0439
/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-07 02:53:24,909 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM

In the ambari UI, i could see below alerts on all machines:

Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042 (<urlopen error [Errno 111] Connection refused>)
Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
1 REPLY 1
Highlighted

Re: All Node managers are down in my hadoop cluster built by Ambari

New Contributor

@Geoffrey Shelton Okot

Any idea on this problem?

Don't have an account?
Coming from Hortonworks? Activate your account here