Support Questions

Find answers, ask questions, and share your expertise

The Cloudera Manager Agent is not able to communicate with this role's web server.

avatar
New Contributor

Hi 

 

I am getting  "The Cloudera Manager Agent is not able to communicate with this role's web server." error in HDFS, HBASE and YARN services,

rest services are not giving any error.

 

I am able to execute the code in all services but still getting above error message.  

 

Following are the logs in "cloudera-scm-agent.log" :

 

ERROR Failed to collect java-based DNS names
Traceback (most recent call last):
File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 64, in collect
result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 46, in _subprocess_with_timeout
return subprocess_with_timeout(args, timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
close_fds=True)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception

 

48487 Monitor-DataNodeMonitor throttling_logger ERROR Error fetching metrics at 'http://hdslave01.cibil.com:50075/jmx'
Traceback (most recent call last):
File "/usr/lib/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 399, in collect_metrics_from_url
openedUrl = self.urlopen(url, username=username, password=password)
File "/usr/lib/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 363, in urlopen
password=password)
File "/usr/lib/cmf/agent/src/cmf/url_util.py", line 58, in urlopen_with_timeout
return opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)

 

File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Unavailable

 

Correct me if i am wrong, is these error are related to HTTPs server?

I didn't get these error in CDH4.5, after installation fresh copy of CDH5.2 I am getting these error.

Does HBase,HDFS,YARN services requires mandate HTTPS server or TLS configuration, if so how to fix the configuration ?

 

Thanks in advance   

10 REPLIES 10

avatar
Explorer

Hii,

Where you able to find the cause to your problem? we expriance the same problem.

avatar
Master Guru

@galzoran,

 

That original post was from years ago, so let's get your information so we can make sure we are troubleshooting the same thing.  The stack traces from your issue will be more useful than the old ones.

 

The agent will periodically make an HTTP request of roles running on the same host as the agent to load JMX output and supply that to Service Monitor for metrics collection.  If that JMX loading fails, you can see events listed in Cloudera Manager indicating as much.

 

The best thing you can do to start off is to get the stack traces that occur when the agent fails to access the JMX information in the web resource.  This information will be in the agent logs on that host (/var/log/cloudera-scm-agent/cloudera-scm-agent.log by default).

 

If you can show us a few of those, it will give us a good idea of what we can look at next.

 

 

avatar
New Contributor

[20/Mar/2020 13:18:15 +0000] 89447 Monitor-GenericMonitor throttling_logger ERROR Error fetching metrics at 'https://NodeManager.example.com:8042/jmx'
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
self._adapter.safety_valve))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 241, in urlopen_with_retry_on_authentication_errors
return function()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/generic/metric_collectors.py", line 220, in _open_url
password=self._password_value)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 82, in urlopen_with_timeout
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 229, in https_open
return self.do_open(HTTPSConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1211, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib64/python2.7/httplib.py", line 1056, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python2.7/httplib.py", line 1090, in _send_request
self.endheaders(body)
File "/usr/lib64/python2.7/httplib.py", line 1052, in endheaders
self._send_output(message_body)
File "/usr/lib64/python2.7/httplib.py", line 890, in _send_output
self.send(msg)
File "/usr/lib64/python2.7/httplib.py", line 852, in send
self.connect()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 224, in connect
self.sock.connect((self.host, self.port))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 309, in connect
ret = self.connect_ssl()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
return m2.ssl_connect(self.ssl, self._timeout)
SSLError: sslv3 alert handshake failure
[20/Mar/2020 13:18:15 +0000] 89447 MonitorDaemon-Scheduler daemon INFO Monitor expired: ('GenericMonitor YARN-NODEMANAGER for yarn-NODEMANAGER-f6d6bb549fe50d7a52ac23558edb4509',)
[20/Mar/2020 13:18:19 +0000] 89447 MonitorDaemon-Reporter firehoses INFO Creating a connection to the SERVICEMONITOR.
[20/Mar/2020 13:18:19 +0000] 89447 MonitorDaemon-Reporter firehoses INFO Creating a connection to the HOSTMONITOR.
[20/Mar/2020 13:21:18 +0000] 89447 MonitorDaemon-Reporter throttling_logger INFO Descendants user CPU lower than expected for process 10751: 18630414.15, 18630292.07
[20/Mar/2020 13:27:01 +0000] 89447 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.05 LIFE_MAX:0.11

avatar
New Contributor

This is from your log.

SSLError: sslv3 alert handshake failure

 

Looks like SSL certificate issue. Check keystore and truststore for certificates.

avatar
Master Guru

Hi @Haris ,

 

The error and stack trace shows us that the agent attempted to connect to:

 

https://NodeManager.example.com:8042/jmx

 

However, the connection failed and the agent threw an exception:

 

File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 295, in connect_ssl
return m2.ssl_connect(self.ssl, self._timeout)
SSLError: sslv3 alert handshake failure

 

In the above, we see that a call was made to ssl_connect but a failure alert was returned.  This indicates that the server (the NodeManager) failed the TLS handshake for some reason.  If this was an issue with something on the client side (agent) then we would expect a more descriptive error on the client side about why the failure occurred.

 

If this is an error on the server side, it is possible the NodeManager log may have some information, but I doubt it.

 

I would recommend testing with curl on https://NodeManager.example.com:8042/jmx:

 

curl -v -k https://NodeManager.example.com:8042/jmx

 

The above should return JMX.  If it doesn't, share the error with us.

Curl uses openssl libraries like the Agent does, so if curl works, so should the agent.

 

Note:  does the problem happen all the time or sometimes?

 

avatar
New Contributor

@bgooley Thanks for taking the time to look at the issue.

As suggested I have ran the curl command and below is the output.

I haven't seen this error on other nodes in the cluster just this one. i'll check the certs and keystore also.

 

* About to connect() to nodemanager.example.com port 8042 (#0)
* Trying 10.28.153.47...
* Connected to nodemanager.example.com (10.28.153.47) port 8042 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* NSS error -12286 (SSL_ERROR_NO_CYPHER_OVERLAP)
* Cannot communicate securely with peer: no common encryption algorithm(s).
* Closing connection 0
curl: (35) Cannot communicate securely with peer: no common encryption algorithm(s).

avatar
Master Guru

Hi @Haris ,

 

Thanks!  that error shows the server said that it could find no matching ciphers to allow the TLS handshake to occur.

In the TLS 1.x handshake, the client will send a ClientHello message to the server.  The server will then find the strongest cipher it has that shows up in that client list.  If it cannot find any, it will return the error you mention.

 

There are a few reasons this error might happen:

  • There is no private key in the NodeManager Keystore
  • There is a private key and a trusted certertificate that have the same public certificate (this should only impact CDH 6 and higher)
  • The keystore is in PKCS12 format (even though the file is named with a JKS format)
  • The client and server ciphers really don't overlap (super unlikely unless you have been changing cipher support)

For starters, it would be good to have a look at your server's keystore (that it uses to start and listen via TLS).  To do so you could use Java's keytool:

 

keytool -list -keystore /path/to/servers/jks/file

 

If you don't see any PrivateKeyEntry in the output, then it would seem there is no private key and the server cannot use TLS.

 

If there is a PrivateKeyEntry, it could be the file is not in JKS format.  The best way to verify that the keystore you are using is in JKS format is to use the linux "xxd" command like this:


xxd -l 10 /path/to/servers/jks/file

 

If the above command output starts out like this, then it is a JKS file.  If not, it is a different format:


0000000: feed feed

avatar
New Contributor

@bgooley pvt key was missing. Thanks for your help

avatar
Master Guru

@Haris,

 

Glad to hear you found the cause and solution.  It took a lot of sweat and tears to get to that short list of possible causes for the condition, so I'm really glad it was one of them :-).

 

Cheers!