Reply
Explorer
Posts: 22
Registered: ‎12-19-2017

HTTP error CM-Agent with YARN HA on RM Standby node

[ Edited ]

Problem: Cloudera Agent HTTP error 401.
Local Kerberos: Active

Version: CDH 6.0.0

 

HDFS and YARN, both with HA on, running perfectly. Disabling/Enabling ResourceManager works perfectly. One RM goes active after the other goes down (example). 

 

Still, whenever one RM goes in Standby mode, the Cloudera-SCM-Agent starts showing the following error.

 

[21/Nov/2018 15:00:31 +0000] 30488 Monitor-GenericMonitor url          ERROR    Autentication error on attempt 1. Retrying after sleeping 1.000000 seconds.
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 241, in urlopen_with_retry_on_authentication_errors
    return function()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/generic/metric_collectors.py", line 220, in _open_url
    password=self._password_value)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 82, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 469, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 203, in http_error_401
    retry = self.http_error_auth_reqed(host, req, headers)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 127, in http_error_auth_reqed
    return self.retry_http_kerberos_auth(req, headers, neg_value)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 143, in retry_http_kerberos_auth
    resp = self.parent.open(req)
  File "/usr/lib64/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 469, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 656, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 219, in http_error_default
    raise e
HTTPError: HTTP Error 401: Authentication required

 

If, we disable the property "Enable Kerberos Authentication for HTTP Web-Consoles" this problem goes away. 

 

At the moment this is the only problematic situation we have with Kerberos. We've tried to Regenerate Keytabs, but the problem remains. 

This only happens in the Agent that has the Standby Resource Manager. On the Active there is no problem. If we shutdown the Active, the Standby goes up (to Active as it should), and then the error starts appearing in that Agent (in the "new" Standby). 

 

Also, if we remove the HA from YARN, (only one RM "active") the problem goes away... 

 

Any ideas? 

 

 

 

[Update]: we found out the following (fake names for example). 

klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Default principal: HTTP/sl000060.besp.dsp.gbes@CLBGDXD01.BESP.DSP.GBES

Valid starting       Expires              Service principal
11/21/2018 16:44:55  11/22/2018 16:44:55  krbtgt/LOCAL.REALM@LOCAL.REALM
        renew until 11/26/2018 16:44:55
11/21/2018 16:45:25  11/22/2018 16:44:55  HTTP/sl000060.domain.stuff@
        renew until 11/26/2018 16:44:55
11/21/2018 16:45:25  11/22/2018 16:44:55  HTTP/sl000060.domain.stuff@LOCAL.REALM
        renew until 11/26/2018 16:44:55

Where is this coming from stuff@, when we only have HTTP principals in kadmin.local with the format of HTTP/node.domain.stuff@LOCAL.REALM ???

Explorer
Posts: 22
Registered: ‎12-19-2017

Re: HTTP error CM-Agent with YARN HA on RM Standby node

[ Edited ]

Hi. I've manage to solve (at least) the ticket cache. We were missing in krb5.conf the information below [domain_realm] tag. Restarted all agents. (I'm still not 100% sure if this was it...)

 

klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Default principal: HTTP/sl000060.domain.stuff@LOCAL.REALM

Valid starting       Expires              Service principal
11/22/2018 09:00:10  11/23/2018 09:00:10  krbtgt/LOCAL.REALM@LOCAL.REALM
        renew until 11/27/2018 09:00:10
11/22/2018 09:01:04  11/23/2018 09:00:10  HTTP/sl000060.domain.stuff@LOCAL.REALM
        renew until 11/27/2018 09:00:10

Still, the reported error didn't go away...

 

 

Summary Notes: the principals are just examples of how the things are configured.

I remember you guys, that this is only happening with the following:

1. Only the Cloudera Agent with the Yarn RM on Standby (High Availability on) presents the error.
2. If we disable the property in YARN, "Enable Kerberos Authentication for HTTP Web-Consoles" the error goes away.

3. All other services are running perfectly accordingly to Cloudera Manager (no alarms/warnings) except the reported "Bad : The Cloudera Manager Agent is not able to communicate with this role's web server." on the Standby YARN RM. 
4. Zookeeper seems to be performing the correct actions, Activating the Standby RM after disabling the previous Active one. (Standby to Active, and vice versa).

5. If we disable the High Availability on YARN, therefore having only one Resource Manager, the problem goes away.

New Contributor
Posts: 3
Registered: ‎09-12-2018

Re: HTTP error CM-Agent with YARN HA on RM Standby node

The same is happening in a cluster that we implemented in a client. Exactly the same behaviour is observed.
I have not found yet a workaround, neither I found an answer to help fixing the problem. 
What I am thinking is that this will be fixed by itself in a next release.

Thanks!

Explorer
Posts: 22
Registered: ‎12-19-2017

Re: HTTP error CM-Agent with YARN HA on RM Standby node

[ Edited ]

We basically disabled "Kerberos Authentication for HTTP Web-Consoles" in YARN... so.. yeh... lets hope someone finds this and figures out what is happening.

New Contributor
Posts: 3
Registered: ‎09-12-2018

Re: HTTP error CM-Agent with YARN HA on RM Standby node

Hi João.

Yeah, that is a workaround to prevent the yarn standby resource manager from going red.

It is still a mystery to me how the resource manager active works normally, and the standby doesn't.. since they call for the same python libraries.. even if you do a manual failover, the node that was active and became the "new" standy will also fall into the same issue.

Anyways.. let's await for the answer from our friends from Cloudera.

New Contributor
Posts: 1
Registered: ‎12-18-2018

Re: HTTP error CM-Agent with YARN HA on RM Standby node

Hi, 

 

we have the same problem. But we were thinking about: "Kerberos Authentication for HTTP Web-Consoles" -> Does standby node have a Web Console? We tried on another Cloudera client and there is a redirection to active one...maybe the error could be in the authentication management of standby node...

New Contributor
Posts: 3
Registered: ‎09-12-2018

Re: HTTP error CM-Agent with YARN HA on RM Standby node

Pacioz, I tested here in a cluster that we have in our lab and each Resource Manager points out to the instance where the Resource Manager is installed. So there are two diferent web consoles. But the option for "Kerberos Authentication for HTTP Web-Consoles" was disabled.
Posts: 1,035
Topics: 1
Kudos: 258
Solutions: 128
Registered: ‎04-22-2014

Re: HTTP error CM-Agent with YARN HA on RM Standby node

Everyone,

 

This issue is a bug in CDH 6 YARN where the Standby Resource Manager redirects "metrics" and "jmx" requests to the Active Resource Manager.  in CDH 5, jmx and metrics were redirected.

 

If Kerberos Authentication for HTTP Web-Consoles is enabled for YARN, then the Cloudera Manager agent on the Standby RM host will attempt to collect jmx information from it.  The python algorithm for Kerberos expects that any "redirect" response from the server is an error so it fails the authentication attempt.  That's why you see, in the agent logs:

HTTPError: HTTP Error 401: Authentication required

 

The Standby redirect still occurs without error if Kerberos Authentication for HTTP Web-Consoles is disabled but no error will occur since the agent can follow the redirect to the active RM if authentication is not required.

 

There are not many options to work around this:

 

- ignore the standby bad health

- suppress alerts for "Web Server Status" for Resource Manager

- Disable Kerberos Authentication for HTTP Web-Consoles for YARN

 

Internal Cloudera Jira for reference:

 

CDH-76040

 

Cloudera is evaluationg the issue and determining the best course of action.

Currently, the issue is in CDH 6.1 as well as CDH 6.0x

 

 

Announcements