Created on 11-21-2018 07:56 AM - edited 09-16-2022 06:55 AM
Problem: Cloudera Agent HTTP error 401.
Local Kerberos: Active
Version: CDH 6.0.0
HDFS and YARN, both with HA on, running perfectly. Disabling/Enabling ResourceManager works perfectly. One RM goes active after the other goes down (example).
Still, whenever one RM goes in Standby mode, the Cloudera-SCM-Agent starts showing the following error.
[21/Nov/2018 15:00:31 +0000] 30488 Monitor-GenericMonitor url ERROR Autentication error on attempt 1. Retrying after sleeping 1.000000 seconds. Traceback (most recent call last): File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 241, in urlopen_with_retry_on_authentication_errors return function() File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/generic/metric_collectors.py", line 220, in _open_url password=self._password_value) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 82, in urlopen_with_timeout return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 437, in open response = meth(req, response) File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.7/urllib2.py", line 469, in error result = self._call_chain(*args) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 203, in http_error_401 retry = self.http_error_auth_reqed(host, req, headers) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 127, in http_error_auth_reqed return self.retry_http_kerberos_auth(req, headers, neg_value) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 143, in retry_http_kerberos_auth resp = self.parent.open(req) File "/usr/lib64/python2.7/urllib2.py", line 437, in open response = meth(req, response) File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.7/urllib2.py", line 469, in error result = self._call_chain(*args) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 656, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/usr/lib64/python2.7/urllib2.py", line 437, in open response = meth(req, response) File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.7/urllib2.py", line 475, in error return self._call_chain(*args) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 219, in http_error_default raise e HTTPError: HTTP Error 401: Authentication required
If, we disable the property "Enable Kerberos Authentication for HTTP Web-Consoles" this problem goes away.
At the moment this is the only problematic situation we have with Kerberos. We've tried to Regenerate Keytabs, but the problem remains.
This only happens in the Agent that has the Standby Resource Manager. On the Active there is no problem. If we shutdown the Active, the Standby goes up (to Active as it should), and then the error starts appearing in that Agent (in the "new" Standby).
Also, if we remove the HA from YARN, (only one RM "active") the problem goes away...
Any ideas?
[Update]: we found out the following (fake names for example).
klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0 Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0 Default principal: HTTP/sl000060.besp.dsp.gbes@CLBGDXD01.BESP.DSP.GBES Valid starting Expires Service principal 11/21/2018 16:44:55 11/22/2018 16:44:55 krbtgt/LOCAL.REALM@LOCAL.REALM renew until 11/26/2018 16:44:55 11/21/2018 16:45:25 11/22/2018 16:44:55 HTTP/sl000060.domain.stuff@ renew until 11/26/2018 16:44:55 11/21/2018 16:45:25 11/22/2018 16:44:55 HTTP/sl000060.domain.stuff@LOCAL.REALM renew until 11/26/2018 16:44:55
Where is this coming from stuff@, when we only have HTTP principals in kadmin.local with the format of HTTP/node.domain.stuff@LOCAL.REALM ???
Created on 11-22-2018 01:12 AM - edited 11-22-2018 02:30 AM
Hi. I've manage to solve (at least) the ticket cache. We were missing in krb5.conf the information below [domain_realm] tag. Restarted all agents. (I'm still not 100% sure if this was it...)
klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0 Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0 Default principal: HTTP/sl000060.domain.stuff@LOCAL.REALM Valid starting Expires Service principal 11/22/2018 09:00:10 11/23/2018 09:00:10 krbtgt/LOCAL.REALM@LOCAL.REALM renew until 11/27/2018 09:00:10 11/22/2018 09:01:04 11/23/2018 09:00:10 HTTP/sl000060.domain.stuff@LOCAL.REALM renew until 11/27/2018 09:00:10
Still, the reported error didn't go away...
Summary Notes: the principals are just examples of how the things are configured.
I remember you guys, that this is only happening with the following:
1. Only the Cloudera Agent with the Yarn RM on Standby (High Availability on) presents the error.
2. If we disable the property in YARN, "Enable Kerberos Authentication for HTTP Web-Consoles" the error goes away.
3. All other services are running perfectly accordingly to Cloudera Manager (no alarms/warnings) except the reported "Bad : The Cloudera Manager Agent is not able to communicate with this role's web server." on the Standby YARN RM.
4. Zookeeper seems to be performing the correct actions, Activating the Standby RM after disabling the previous Active one. (Standby to Active, and vice versa).
5. If we disable the High Availability on YARN, therefore having only one Resource Manager, the problem goes away.
Created 12-18-2018 03:35 AM
The same is happening in a cluster that we implemented in a client. Exactly the same behaviour is observed.
I have not found yet a workaround, neither I found an answer to help fixing the problem.
What I am thinking is that this will be fixed by itself in a next release.
Thanks!
Created on 12-18-2018 04:01 AM - edited 12-18-2018 06:23 AM
We basically disabled "Kerberos Authentication for HTTP Web-Consoles" in YARN... so.. yeh... lets hope someone finds this and figures out what is happening.
Created 12-18-2018 04:04 AM
Hi João.
Yeah, that is a workaround to prevent the yarn standby resource manager from going red.
It is still a mystery to me how the resource manager active works normally, and the standby doesn't.. since they call for the same python libraries.. even if you do a manual failover, the node that was active and became the "new" standy will also fall into the same issue.
Anyways.. let's await for the answer from our friends from Cloudera.
Created 12-18-2018 05:29 AM
Hi,
we have the same problem. But we were thinking about: "Kerberos Authentication for HTTP Web-Consoles" -> Does standby node have a Web Console? We tried on another Cloudera client and there is a redirection to active one...maybe the error could be in the authentication management of standby node...
Created 12-18-2018 05:43 AM
Created 12-18-2018 09:01 AM
Everyone,
This issue is a bug in CDH 6 YARN where the Standby Resource Manager redirects "metrics" and "jmx" requests to the Active Resource Manager. in CDH 5, jmx and metrics were redirected.
If Kerberos Authentication for HTTP Web-Consoles is enabled for YARN, then the Cloudera Manager agent on the Standby RM host will attempt to collect jmx information from it. The python algorithm for Kerberos expects that any "redirect" response from the server is an error so it fails the authentication attempt. That's why you see, in the agent logs:
HTTPError: HTTP Error 401: Authentication required
The Standby redirect still occurs without error if Kerberos Authentication for HTTP Web-Consoles is disabled but no error will occur since the agent can follow the redirect to the active RM if authentication is not required.
There are not many options to work around this:
- ignore the standby bad health
- suppress alerts for "Web Server Status" for Resource Manager
- Disable Kerberos Authentication for HTTP Web-Consoles for YARN
Internal Cloudera Jira for reference:
CDH-76040
Cloudera is evaluationg the issue and determining the best course of action.
Currently, the issue is in CDH 6.1 as well as CDH 6.0x