Support Questions
Find answers, ask questions, and share your expertise

NodeManager seen down from Amabri

Highlighted

NodeManager seen down from Amabri

Hi,

Last week, we had a problem with our Oozie job:

0001800-160525142639587-oozie-oozi-W is KILLED 0001808-160525142639587-oozie-oozi-W is FAILED

....

Looking at Amabri, we saw that one of NodeManagers was down when the process on this nodeManager runs with no problem!

we had to restart the process on the nodeManager to resume the jobs.

I note that no log founded on Ambari-server, resourceManager and nodeManager at the time of problem!

Any idea plz ?

Other question: If a nodeManager is down, Hadoop should not runs the job failed on the another Node ?

Thx

HDP version: 2.3

Amabri version: 2.1

2 REPLIES 2
Highlighted

Re: NodeManager seen down from Amabri

@Ahmed ELJAMI

For Ques 1 - There might be stale indication in ambari which might have shown Nodemanager as down but the process was running fine when you checked on the node and hence you were not able to see any logs for the reason NM was showing down in ambari. You might can take a look at the logs in ambari-server and ambari-alerts.log - check if you see any notification for nodemanager.

For Ques 2 - If a nodeManager is down, Hadoop should not runs the job failed on the another Node ?

--> Hadoop never runs Failed job on other node.

Re: NodeManager seen down from Amabri

@Sagar Shimpi

Errors in ambari-server.log:

2016-06-08 04:28:33,369 [CRITICAL] [YARN] [yarn_nodemanager_webui] (NodeManager Web UI) Connection failed to http://node8.mapreduce:8042 (timed out) 2016-06-08 04:28:33,371 [CRITICAL] [YARN] [yarn_nodemanager_health] (NodeManager Health) Connection failed to http://node8.mapreduce:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 165, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 415, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 371, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) timeout: timed out ) 2016-06-08 04:29:23,922 [OK] [YARN] [yarn_nodemanager_webui] (NodeManager Web UI) HTTP 200 response in 0.002s 2016-06-08 04:29:23,923 [OK] [YARN] [yarn_nodemanager_health] (NodeManager Health) NodeManager Healthy

Don't have an account?