Dear all,
I got the following error in my HDFS datanodes. However , kinit with the keytab files that generated by the CM is fine and java security policy has been applied.
Is there anyway to diagnose such a problem?
For more information:
no LDAP is set in the cluster nodes. cloudera manager and CDH version 5.13.2.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:594) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:396) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557) at org.apache.hadoop.ipc.Client.call(Client.java:1480) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy23.versionRequest(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.versionRequest(DatanodeProtocolClientSideTranslatorPB.java:275) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.retrieveNamespaceInfo(BPServiceActor.java:168) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:214) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:745) Caused by: GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:710) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) ... 20 more Caused by: KrbException: Ticket expired (32) - PROCESS_TGS at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:73) at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:192) at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:203) at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:309) at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:115) at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:454) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:641) ... 23 more Caused by: KrbException: Identifier doesn't match expected value (906) at sun.security.krb5.internal.KDCRep.init(KDCRep.java:143) at sun.security.krb5.internal.TGSRep.init(TGSRep.java:66) at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:61) at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:55) ... 29 more
In the /var/log/krb5kdc.log, I got some message like that:
Oct 02 09:52:12 kdc.fqdn krb5kdc[2426](info): TGS_REQ (1 etypes {23}) 172.32.237.83: PROCESS_TGS: authtime 0, hdfs/datanode.fqdn@REALM for hdfs/namenode.fqdn@REALM, Ticket expired
Thanks,
Roy
Created 10-03-2018 02:57 AM
Dear all,
After I updated the kerberos packages, the cluster resumed.
Hope this help. Thanks.
For more information: OS: CentOS Linux release 7.5.1804 (Core)
Packages version after update:
[root@namenode x86_64]# rpm -qa | grep krb
krb5-server-1.15.1-19.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
krb5-libs-1.15.1-19.el7.x86_64
krb5-devel-1.15.1-19.el7.x86_64
krb5-workstation-1.15.1-19.el7.x86_64
[root@datanode01 ~]# rpm -qa | grep krb
krb5-devel-1.15.1-19.el7.x86_64
krb5-workstation-1.15.1-19.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
krb5-libs-1.15.1-19.el7.x86_64
Roy
Created 10-01-2018 08:57 PM
Hi, did you try 'kdestroy' first and then 'kinit' again to generate a new auth ticket.
Created on 10-01-2018 11:07 PM - edited 10-01-2018 11:08 PM
Hi irobot,
Yes, I did tried it.
There is no problem for the kinit. the result will be like below:
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs kdestroy
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist
klist: No credentials cache found (filename: /tmp/krb5cc_996)
[root@datanode01 471-hdfs-DATANODE]# date
Tue Oct 2 13:54:09 CST 2018
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs kinit -kt hdfs.keytab hdfs/datanode01.domain@REALM
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist
Ticket cache: FILE:/tmp/krb5cc_996
Default principal: hdfs/datanode01.domain@REALM
Valid starting Expires Service principal
10/02/2018 13:54:16 10/03/2018 13:54:16 krbtgt/DOMAIN@REALM
renew until 10/09/2018 13:54:16
[root@datanode01 471-hdfs-DATANODE]#
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist -e
Ticket cache: FILE:/tmp/krb5cc_996
Default principal: hdfs/datanode01.domain@REALM
Valid starting Expires Service principal
10/02/2018 13:54:16 10/03/2018 13:54:16 krbtgt/DOMAIN@REALM
renew until 10/09/2018 13:54:16, Etype (skey, tkt): arcfour-hmac, aes256-cts-hmac-sha1-96
Thanks,
Roy
Created 10-01-2018 11:24 PM
Hi irobot,
Yes. The result will be like that.
[root@datanode01 471-hdfs-DATANODE]# pwd
/var/run/cloudera-scm-agent/process/471-hdfs-DATANODE
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist -kt hdfs.keytab
Keytab name: FILE:hdfs.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 HTTP/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
6 09/27/2018 22:49:46 hdfs/datanode01.domain@REALM
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs kdestroy
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist
klist: No credentials cache found (filename: /tmp/krb5cc_996)
[root@datanode01 471-hdfs-DATANODE]# date
Tue Oct 2 13:54:09 CST 2018
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs kinit -kt hdfs.keytab hdfs/datanode01.domain@REALM
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist
Ticket cache: FILE:/tmp/krb5cc_996
Default principal: hdfs/datanode01.domain@REALM
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs kvno hdfs/datanode01.domain@REALM
hdfs/datanode01.domain@REALM: kvno = 6
Thanks,
Roy
Created 10-01-2018 11:27 PM
[root@datanode01 471-hdfs-DATANODE]# sudo -u hdfs klist -e
Ticket cache: FILE:/tmp/krb5cc_996
Default principal: hdfs/datanode01.domain@REALM
Valid starting Expires Service principal
10/02/2018 13:54:16 10/03/2018 13:54:16 krbtgt/DOMAIN@REALM
renew until 10/09/2018 13:54:16, Etype (skey, tkt): arcfour-hmac, aes256-cts-hmac-sha1-96
Created 10-02-2018 09:31 AM
Are you saying that if you restart the DataNodes that this issue happens right away?
The DataNode, for RPC communication, will get a TGT (Kerberos Ticket Granting Ticket) via UserGroupInformation.loginUserFromKeytab()
This means that there is no visible cache file you can view to see the experiation time.
The exception in the stack trace means that there was a TGT acquired and stored in memory, but when there was an attempt to get s Service Ticket to connect to the Active NameNode, the KDC responded that it could not process the request since the TGT had expired.
With "loginUserFromKeytab()" if the RPC connection fails, the code has built-in mechanisms that will attempt to handle the condition and re-login.
What happens after this message? Does the DataNode eventually recover from this?
Created 10-02-2018 06:19 PM
Hi bgooley,
Thanks for your reply
Are you saying that if you restart the DataNodes that this issue happens right away? <= Yes
Actually, there is similar problem on HBase after i enabled the Kerberos and failed to start the cluster.
Is there anything I can do for diagnose for the "getting Service Ticket "?
No. The datanode eventually could not connected to the namenode.
I enabled the debug log level on datanode and the debug for kerberos and got the following log:
2018-09-27 22:54:36,552 DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 50020: task running
2018-09-27 22:54:41,340 DEBUG org.apache.hadoop.ipc.Client: The ping interval is 60000 ms.
2018-09-27 22:54:41,340 DEBUG org.apache.hadoop.ipc.Client: Connecting to namenode.domain/172.32.237.80:8022
2018-09-27 22:54:41,341 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hdfs/datanode01.domain@REALM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
2018-09-27 22:54:41,341 DEBUG org.apache.hadoop.security.SaslRpcClient: Sending sasl message state: NEGOTIATE
2018-09-27 22:54:41,342 DEBUG org.apache.hadoop.security.SaslRpcClient: Get token info proto:interface org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB info:null
2018-09-27 22:54:41,342 DEBUG org.apache.hadoop.security.SaslRpcClient: Get kerberos info proto:interface org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB info:@org.apache.hadoop.security.KerberosInfo(clientPrincipal=dfs.datanode.kerberos.principal, serverPrincipal=dfs.namenode.kerberos.principal)
2018-09-27 22:54:41,342 DEBUG org.apache.hadoop.security.SaslRpcClient: RPC Server's Kerberos principal name for protocol=org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is hdfs/namenode.domain@REALM
2018-09-27 22:54:41,342 DEBUG org.apache.hadoop.security.SaslRpcClient: Creating SASL GSSAPI(KERBEROS) client to authenticate to service at namenode.domain
2018-09-27 22:54:41,342 DEBUG org.apache.hadoop.security.SaslRpcClient: Use KERBEROS authentication for protocol DatanodeProtocolPB
2018-09-27 22:54:41,345 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs/datanode01.domain@REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]
2018-09-27 22:54:41,345 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hdfs/datanode01.domain@REALM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:681)
2018-09-27 22:54:41,345 DEBUG org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]
2018-09-27 22:54:42,716 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hdfs/datanode01.domain@REALM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
2018-09-27 22:54:42,716 DEBUG org.apache.hadoop.security.SaslRpcClient: Sending sasl message state: NEGOTIATE
Thanks,
Roy
Created 10-02-2018 08:00 PM
Hi all,
Here is more information, the namenode is also the kdc server and the package version is as below:
[root@namenode conf.cloudera.hdfs]# rpm -qa | grep krb
krb5-server-1.15.1-18.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
krb5-libs-1.15.1-18.el7.x86_64
krb5-workstation-1.15.1-18.el7.x86_64
krb5-devel-1.15.1-18.el7.x86_64
[root@datanode01 ~]# rpm -qa | grep krb
krb5-devel-1.15.1-18.el7.x86_64
krb5-workstation-1.15.1-18.el7.x86_64
krb5-libs-1.15.1-18.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
Thanks,
Roy
Created 10-03-2018 02:57 AM
Dear all,
After I updated the kerberos packages, the cluster resumed.
Hope this help. Thanks.
For more information: OS: CentOS Linux release 7.5.1804 (Core)
Packages version after update:
[root@namenode x86_64]# rpm -qa | grep krb
krb5-server-1.15.1-19.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
krb5-libs-1.15.1-19.el7.x86_64
krb5-devel-1.15.1-19.el7.x86_64
krb5-workstation-1.15.1-19.el7.x86_64
[root@datanode01 ~]# rpm -qa | grep krb
krb5-devel-1.15.1-19.el7.x86_64
krb5-workstation-1.15.1-19.el7.x86_64
sssd-krb5-common-1.16.0-19.el7.x86_64
sssd-krb5-1.16.0-19.el7.x86_64
krb5-libs-1.15.1-19.el7.x86_64
Roy
Created 10-03-2018 12:07 PM
Nice sleuthing... indeed you hit that buggy version. Here is more info on the issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1560951