Member since
10-09-2017
3
Posts
0
Kudos Received
0
Solutions
07-16-2018
04:38 AM
@Geoffrey Shelton Okot Any idea on this problem?
... View more
07-15-2018
09:23 AM
Hi Using Ambari, i have setup hadoop cluster on 3 ec2 m4.2xlarge instances. For few months it has worked fine. But later on i found that node managers are gettin shutdown. I have tried few solutions mentioned in this area like deleting: /var/log/hadoop-yarn/nodemanager/recovery-state/ and triying to restart again. But no luck. On all nodes i found that cpu usage went up to 600%: top - 08:31:21 up 3:35, 1 user, load average: 10.42, 10.60, 10.81 Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie %Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32780760 total, 14747292 free, 15611484 used, 2421984 buff/cache KiB Swap: 0 total, 0 free, 0 used. 16513908 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20737 yarn 20 0 314724 22880 392 S 780.1 0.1 1401:35 java 1576 root 20 0 8040888 1.226g 13420 S 16.2 3.9 23:19.59 java 5189 hdfs 20 0 2946168 408120 24552 S 0.7 1.2 0:35.23 java [root@ip-172-31-37-158 ~]# ps -ef | grep 20737 yarn 20737 1 99 05:26 ? 23:24:42 /var/tmp/java -c /var/tmp/w.conf Not sure what is the process belonging to "/var/tmp/java -c /var/tmp/w.conf". In the Nodemanager logs i could see below error messages: /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-06 03:12:24,348 ERROR filecontroller.LogAggregationFileController (LogAggregationFileController.java:run(360)) - Failed to setup application log directory for application_1528245572429_0439 /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-ip-172-31-37-158.us-east-2.compute.internal.log.1:2018-06-07 02:53:24,909 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM In the ambari UI, i could see below alerts on all machines: Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042 (<urlopen error [Errno 111] Connection refused>)
Connection failed to http://ip-172-31-37-158.us-east-2.compute.internal:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> )
... View more
Labels:
- Labels:
-
Apache Hadoop