Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

After deploying new cluster NodeManager crashes after a few minutes - NodeManager WebUI gets [Errno 111] Connection refused

After deploying new cluster NodeManager crashes after a few minutes - NodeManager WebUI gets [Errno 111] Connection refused

New Contributor

This is a brand new Cluster I deployed on AWS. I've got 3 nodes. No errors on deployment but after a few minutes the NodeManager on one of the nodes is crashing.

The error from the NodeManager Web UI alert is:

Connection failed to http://FQDN:8042 (<urlopen error [Errno 111] Connection refused>)

If I run a netstat -nltp |grep 8042 there is nothing listening on that port.

The error from the NodeManager Health alert is:

Connection failed to http://FQDN:8042/ws/v1/node/info (Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen ...

I've scoured all the previous posted even slightly related and had no luck figuring out what is wrong. I've looked some of the yarn and mapreduce allocation settings but these don't look unreasonably small:

yarn.scheduler.maximum-allocation-mb 24576

I tried increasing the following:

  • yarn.app.mapreduce.am.resource.mb from 8192 to 16384
  • mapreduce.task.io.sort.mb from 2047 to 4096

Also increased these:

  • Changed: YARN_RESOURCEMANAGER_HEAPSIZE from 1024 to 2048
  • Changed: YARN_NODEMANAGER_HEAPSIZE from 1024 to 2048

How do I troubleshoot this? When I restart the NodeManager Service and monitor the ambari-agent log I'm seeing the following ERROR as it tried to start up:

ERROR 2018-11-08 00:48:23,969 script_alert.py:123 - [Alert][yarn_nodemanager_health] Failed with result CRITICAL: ['Connection failed to http://FQDN:8042/ws/v1/node/info (Traceback (most recent call last):\nFile "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 171, in execute\nurl_response = urllib2.urlopen(query, timeout=connection_timeout)\nFile "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen\nreturn opener.open(url, data, timeout)\nFile "/usr/lib64/python2.7/urllib2.py", line 431, in open\nresponse = self._open(req, data)\nFile "/usr/lib64/python2.7/urllib2.py", line 449, in _open\n\'_open\', req)\nFile "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain\nresult = func(*args)\nFile "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open\nreturn self.do_open(httplib.HTTPConnection, req)\nFile "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open\nraise URLError(err)\nURLError: <urlopen error [Errno 111] Connection refused>\n)']

HELP - I've been working on this all day. I'm stuck. I'm hoping it's something stupid that I'm missing but I'm new to installing Hadoop so I'm not sure how to troubleshoot this.

Don't have an account?
Coming from Hortonworks? Activate your account here