Created on 01-29-2016 06:59 AM - edited 09-16-2022 03:00 AM
Hi all,
I have successfully installed Cloudera Manager 5.5.1 on a private cluster with only HDFS, YARN and Spark.
I keep getting Health Issues every 10 - 15 minutes reporting "Web Server Status : The Cloudera Manager Agent got an unexpected response from this role's web server."
the corresponding entry in the host 's cloudera agent is the following
[29/Jan/2016 16:51:32 +0000] 1237 Monitor-HostMonitor throttling_logger ERROR (30 skipped) Failed to collect NTP metrics Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 37, in collect result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout) File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 30, in _subprocess_with_timeout return subprocess_with_timeout(args, timeout) File "/usr/lib64/cmf/agent/src/cmf/subprocess_timeout.py", line 49, in subprocess_with_timeout p = subprocess.Popen(**kwargs) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory
And another one
[29/Jan/2016 16:48:32 +0000] 1237 Monitor-GenericMonitor throttling_logger ERROR (1 skipped) Error fetching metrics at 'http://host-hd-01.corp.nodalpoint.com:8086/jmx' Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/generic/metric_collectors.py", line 165, in collect_and_parse simplejson.load(opened_url)) File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load return loads(fp.read(), File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 9] Bad file descriptor
Has anyone else noticed similar issues?
Thank you
Created 01-30-2016 10:06 AM
The first issue has to do with NTP [1] and the second, is when the Agent attempted to read the json contents (possibly Service Monitor metrics), it encoutered an error hece the
Error fetching metrics at 'http://host-hd-01.corp.nodalpoint.com:8086/jmx'
Do you know which for which role the health check is reporting - for a reference, can you attach a screenshot?
[1] http://www.cloudera.com/documentation/enterprise/latest/topics/install_cdh_enable_ntp.html
Created 01-30-2016 10:06 AM
The first issue has to do with NTP [1] and the second, is when the Agent attempted to read the json contents (possibly Service Monitor metrics), it encoutered an error hece the
Error fetching metrics at 'http://host-hd-01.corp.nodalpoint.com:8086/jmx'
Do you know which for which role the health check is reporting - for a reference, can you attach a screenshot?
[1] http://www.cloudera.com/documentation/enterprise/latest/topics/install_cdh_enable_ntp.html
Created 02-01-2016 01:06 AM
Thank you Michalis for your quick response.
Regarding the first issue you mention that it is related to NTP.
I use RHEL 7.1 for operating system which uses the chrony service by default instead of NTP.
Do you recommend to replace the chrony service with the ntp service?
Regarding the second issue i am providing screenshots from three different services where this issue occurs
a) from the Host Monitor
[01/Feb/2016 10:44:34 +0000] 1237 Monitor-GenericMonitor throttling_logger ERROR (8 skipped) Error fetching metrics at 'http://host-hd-01.corp.nodalpoint.com:8086/jmx' Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/generic/metric_collectors.py", line 165, in collect_and_parse simplejson.load(opened_url)) File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load return loads(fp.read(), File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 9] Bad file descriptor
with its corresponding screenshot
b) from a Yarn Node Manager
[01/Feb/2016 10:37:19 +0000] 1363 Monitor-GenericMonitor throttling_logger ERROR (6 skipped) Error fetching metrics at 'http://host-hd-03.corp.nodalpoint.com:8042/jmx' Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/generic/metric_collectors.py", line 165, in collect_and_parse simplejson.load(opened_url)) File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load return loads(fp.read(), File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 9] Bad file descriptor
with its corresponding screenshot
c) and from the Name Node
[01/Feb/2016 10:53:34 +0000] 1237 Monitor-GenericMonitor throttling_logger ERROR (1 skipped) Error fetching metrics at 'http://host-hd-01.corp.nodalpoint.com:8087/jmx' Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/generic/metric_collectors.py", line 165, in collect_and_parse simplejson.load(opened_url)) File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load return loads(fp.read(), File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 9] Bad file descriptor
with its corresponding screenshot
Please tell me if you require any more information
Thanks again for your support
Filaretos
Created 02-01-2016 05:12 AM
The issue is resolved.
I replaced the chrony service with the NTP service, according to Michali's recommendation, on all my hosts and all errors stopped.
Not only the errors which where explicitely stating "Failed to collect NTP metrics" but also all other errors. Apparently all these errors where somehow related to the inability to collect NTP metrics.
Thank you!