Support Questions

Find answers, ask questions, and share your expertise

Cloudera Manager Agent is not able to communicate with this HDFS Datanode, Impala Daemon, Hbase, and Yarn Web Server Role

avatar
Contributor

I have recently upgraded from CM 7.6.7 to CM 7.11.3 and CDP 7.1.7 SP2 to CDP 7.1.7 SP3.


HDFS Datanode, Impala Daemon, Yarn Resource Manager, and Hbase Region Server are showing unhealthy web server on Cloudera as shown below.

web-server-error.png

 

After checking one of the agents log, I found the following error.

 

 

18/Nov/2024 09:09:09 +0100] 2414 GM IMPALAD throttling_logger ERROR    Error fetching metrics at 'https://host.domain.com:25000/jsonmetrics?json'
Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 224, in _collect_and_parse_and_return
    opened_url = urlopen_with_retry_on_authentication_errors(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 339, in urlopen_with_retry_on_authentication_errors
    return function()
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 244, in _open_url
    return self._urlopen_callout(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 129, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 1244, in http_error_401
    retry = self.http_error_auth_reqed('www-authenticate',
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 1124, in http_error_auth_reqed
    return self.retry_http_digest_auth(req, authreq)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 1138, in retry_http_digest_auth
    resp = self.parent.open(req, timeout=req.timeout)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/https.py", line 388, in http_error_default
    raise e
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/https.py", line 382, in http_error_default
    return old(self, req, fp, code, msg, hdrs)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error
[18/Nov/2024 09:09:09 +0100] 2414 MonitorDaemon-Reporter firehoses    INFO     Creating a connection to the SERVICEMONITOR.
[18/Nov/2024 09:09:09 +0100] 2414 MonitorDaemon-Reporter firehoses    INFO     Creating a connection to the HOSTMONITOR.
[18/Nov/2024 09:09:55 +0100] 2414 MonitorDaemon-Scheduler daemon       WARNING  Monitor slow to respond in readiness check: 45s GenericMonitor HDFS-DATANODE for hdfs-DATANODE-f8021b8043faaa9d9d23bf9965e6ee07
[18/Nov/2024 09:09:55 +0100] 2414 MonitorDaemon-Scheduler daemon       INFO     Monitor expired: ('GenericMonitor HDFS-DATANODE for hdfs-DATANODE-f8021b8043faaa9d9d23bf9965e6ee07',)
[18/Nov/2024 09:09:55 +0100] 2414 GM NODEMANAGER throttling_logger ERROR    Error fetching metrics at 'https://host.domain.com:61006/jmx'
Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 157, in retry_http_kerberos_auth
    neg_hdr = self.generate_request_header(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 111, in generate_request_header
    result = k.authGSSClientStep(self.context, neg_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Cryptosystem internal error', -1765328206))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 224, in _collect_and_parse_and_return
    opened_url = urlopen_with_retry_on_authentication_errors(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 339, in urlopen_with_retry_on_authentication_errors
    return function()
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 244, in _open_url
    return self._urlopen_callout(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 129, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 228, in http_error_401
    retry = self.http_error_auth_reqed(host, req, headers)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 149, in http_error_auth_reqed
    return self.retry_http_kerberos_auth(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 174, in retry_http_kerberos_auth
    log.critical("GSSAPI Error: %s/%s" % (e[0][0], e[1][0]))
TypeError: 'GSSError' object is not subscriptable
[18/Nov/2024 09:09:55 +0100] 2414 GM DATANODE throttling_logger ERROR    Error fetching metrics at 'https://host.domain.com:9865/jmx'
Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 157, in retry_http_kerberos_auth
    neg_hdr = self.generate_request_header(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 111, in generate_request_header
    result = k.authGSSClientStep(self.context, neg_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Cryptosystem internal error', -1765328206))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 224, in _collect_and_parse_and_return
    opened_url = urlopen_with_retry_on_authentication_errors(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 339, in urlopen_with_retry_on_authentication_errors
    return function()
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 244, in _open_url
    return self._urlopen_callout(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 129, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
pecified GSS failure File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 228, in http_error_401
    retry = self.http_error_auth_reqed(host, req, headers)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 149, in http_error_auth_reqed
    return self.retry_http_kerberos_auth(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 174, in retry_http_kerberos_auth
    log.critical("GSSAPI Error: %s/%s" % (e[0][0], e[1][0]))
TypeError: 'GSSError' object is not subscriptable
[18/Nov/2024 09:09:55 +0100] 2414 GM REGIONSERVER throttling_logger ERROR    Error fetching metrics at 'https://host.domain.com:61005/jmx'
Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 157, in retry_http_kerberos_auth
    neg_hdr = self.generate_request_header(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 111, in generate_request_header
    result = k.authGSSClientStep(self.context, neg_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Cryptosystem internal error', -1765328206))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 224, in _collect_and_parse_and_return
    opened_url = urlopen_with_retry_on_authentication_errors(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 339, in urlopen_with_retry_on_authentication_errors
    return function()
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/monitor/generic/metric_collectors.py", line 244, in _open_url
    return self._urlopen_callout(
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/cmf/util/url.py", line 129, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
  File "/data/anaconda/miniconda_3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 228, in http_error_401
    retry = self.http_error_auth_reqed(host, req, headers)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 149, in http_error_auth_reqed
    return self.retry_http_kerberos_auth(req, headers, neg_value)
  File "/data/cloudera/cm-agent/lib/python3.8/site-packages/urllib_kerberos/__init__.py", line 174, in retry_http_kerberos_auth
    log.critical("GSSAPI Error: %s/%s" % (e[0][0], e[1][0]))
TypeError: 'GSSError' object is not subscriptable

 

 

I have tried everything I could but no luck.

Firstly, the hosts are heart beating.
Secondly, the /etc/krb5.conf seems to be the same for other working host (Hue server host in this case). The Web Server Status issue is the same across HDFS, Hbase, Yarn, and Impala.
Thirdly, I had tried the manual kinit before but it still throw the same error.
After trying manual kinit (kinit -k -t hdfs.keytab hdfs/host.my-default-realm.com) from the latest data node process, I ran the klist command (klist -e) and got the following.

[root@host 1546506889-hdfs-DATANODE]# klist -e
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: HTTP/host@EXAMPLE-REALM.com

Valid starting Expires Service principal
18/11/24 16:59:35 19/11/24 02:59:34 krbtgt/host@EXAMPLE-REALM.COM
renew until 25/11/24 16:59:34, Etype (skey, tkt): arcfour-hmac, aes256-cts-hmac-sha1-96



Below is the configured Kerberos Encryption Types from the Cloudera Manager Console

sayebogbon_0-1731932582515.png

 

Below is part of the host /etc/krb5.conf content.

[libdefaults]
 renew_lifetime = 604800
 ticket_lifetime = 36000
 udp_preference_limit = 1
 permitted_enctypes = rc4-hmac aes256-cts aes128-cts
 default_tgs_enctypes = rc4-hmac aes256-cts aes128-cts
 default_tkt_enctypes = rc4-hmac aes256-cts aes128-cts
 default_realm = my-default-realm.com
 default_etypes = arcfour-hmac-md5
 default_etypes_des = des-cbc-crc
 allow_weak_crypto = true

 forwardable = true
 default_keytab_name = /etc/opt/quest/vas/host.keytab
[libvas]
 site-name-override = iNET-LDAP
 use-dns-srv = true
 use-tcp-only = true

 auth-helper-timeout = 60


Finally, the OS upgrade is not yet performed. We're still on RED Hat OL7.

I know you're busy but any support will be much appreciated.

Thanks,

Stephen

2 REPLIES 2

avatar
Expert Contributor

Hello @sayebogbon ,

Based on the error in the log you shared:

opened_url = urlopen_with_retry_on_authentication_errors

And the klist output showing this:

Valid starting     Expires
10/11/24 23:43:47  11/11/24 09:43:47 

Looks like you need to regenerate the kerberos credentials for this host.

To do so, please stop all services on this host.

Then go to CM > Administration > Security > Kerberos credentials.

In the search bar, type the hostname and select all the principals that appear, then click the regenerate selected button.

If there are no problems, new credentials should be generated.

Restart your services and let us know if that helps.

avatar
Contributor

Apologies, that is a wrong ticket. I should have changed it. I have updated it now.
Previously, I had regenerated both keytabs and kerberos credentials many times but no luck.

Also, after I manually kinit the kerberos ticket using kinit -k -t /var/run/cloudera-scm-agent/process/1546506889-hdfs-DATANODE/hdfs.keytab HTTP/host@EXAMPLE-REALM.COM,I was able to use curl command on the datanode web url (https://fqdn:9865) and got 200 ok response. However, it's seems like Cloudera isn't able to detect the credential for some reason.
See response below.

 

 

[root@host 1546506889-hdfs-DATANODE]# curl -v -k --negotiate -u : https://host.com:9865
* About to connect() to host.com port 9865 (#0)
*   Trying xx.xx.xxx.xx...
* Connected to host.com (xx.xx.xxx.xx) port 9865 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*       subject: CN=host.com,OU=Technology,O=xxxx plc,L=xl,ST=xl,C=GB
*       start date: Nov 14 14:41:16 2024 GMT
*       expire date: Nov 09 14:41:16 2025 GMT
*       common name: host.com
*       issuer: CN=host.com,OU=Technology,O=xxxx plc,L=xl,ST=xl,C=GB
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: host.com:9865
> Accept: */*
>
< HTTP/1.1 401 Authentication required
< Connection: close
< Pragma: no-cache
< Strict_Transport_Security: max-age=0; includeSubDomains
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Pragma: no-cache
< Strict_Transport_Security: max-age=0; includeSubDomains
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate
< Set-Cookie: hadoop.auth=; Path=/; HttpOnly
< Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 447
<
* Closing connection 0
* Issue another request to this URL: 'https://host.com:9865/'
* About to connect() to host.com port 9865 (#1)
*   Trying xx.xx.xxx.xx...
* Connected to host.com (xx.xx.xxx.xx) port 9865 (#1)
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*       subject: CN=host.com,OU=Technology,O=xxxx plc,L=xl,ST=xl,C=GB
*       start date: Nov 14 14:41:16 2024 GMT
*       expire date: Nov 09 14:41:16 2025 GMT
*       common name: host.com
*       issuer: CN=host.com,OU=Technology,O=xxxx plc,L=xl,ST=xl,C=GB
* Server auth using GSS-Negotiate with user ''
> GET / HTTP/1.1
> Authorization: Negotiate xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxg==
> User-Agent: curl/7.29.0
> Host: host.com:9865
> Accept: */*
>
< HTTP/1.1 200 OK
< Connection: close
< Date: Mon, 18 Nov 2024 17:07:11 GMT
< Cache-Control: no-cache
< Expires: Mon, 18 Nov 2024 17:07:11 GMT
< Date: Mon, 18 Nov 2024 17:07:11 GMT
< Pragma: no-cache
< Content-Type: text/html
< Strict_Transport_Security: max-age=0; includeSubDomains
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Expires: Mon, 18 Nov 2024 17:07:11 GMT
< Date: Mon, 18 Nov 2024 17:07:11 GMT
< Pragma: no-cache
< Strict_Transport_Security: max-age=0; includeSubDomains
< X-Content-Type-Options: nosniff
< X-FRAME-OPTIONS: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< WWW-Authenticate: Negotiate xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
< Set-Cookie: hadoop.auth="u=HTTP&p=HTTP/host.com.COM&t=kerberos&e=17xxxxxx7&s=CaYM+xxxxxxxxxxfBXleJ0K/ObFbrjALqy/R//g="; Path=/; HttpOnly
< Last-Modified: Fri, 30 Aug 2024 16:14:30 GMT
< Accept-Ranges: bytes
< Content-Length: 1085
<
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="REFRESH" content="0;url=datanode.html" />
  <title>Hadoop Administration</title>
</head>
* Closing connection 1
</html>[root@host 1546506889-hdfs-DATANODE]#