Created 07-21-2017 06:18 AM
Team,
We have enabled Kerberos on our staging cluster. Its noticed that everyday morning when we try to execute hadoop command like hadoop fs -ls / it fails with error. To fix the following we need to restart the namenodes.
user1@datanode1001:~$ hadoop fs -ls /
<Truncated> Found ticket for user1@EXAMPLE.COM to go to krbtgt/EXAMPLE.COM@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017 Entered Krb5Context.initSecContext with state=STATE_NEW Found ticket for user1@EXAMPLE.COM to go to krbtgt/EXAMPLE.COM@EXAMPLE.COMexpiring on Fri Jul 21 14:43:36 UTC 2017 Found ticket for user1@EXAMPLE.COM to go to nn/namenode1001.example.com@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017 Found ticket for user1@EXAMPLE.COM to go to nn/namenode1002.example.com@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017 Found service ticket in the subjectTicket (hex) = 0000: 61 82 01 6A 30 82 01 66 A0 03 02 01 05 A1 0C 1Ba..j0..f........ 0010: 0A 49 4E 4D 4F 42 49 2E 43 4F 4D A2 31 30 2F A0.EXAMPLE.COM.10/. <trucated>Client Principal = user1@EXAMPLE.COM Server Principal = nn/namenode1001.example.com@EXAMPLE.COM Session Key = EncryptionKey: keyType=18 keyBytes (hex dump)= 0000: 23 68 3A EA 16 21 D4 B9 31 0E C6 F8 C6 39 8D 4E#h:..!..1....9.N 0010: 99 68 B8 6A C5 90 E9 E6 3B 17 08 A0 2E C0 AE 48.h.j....;......H <truncated>
Forwardable Ticket true Forwarded Ticket false Proxiable Ticket false Proxy Ticket false Postdated Ticket false Renewable Ticket false Initial Ticket false Auth Time = Fri Jul 21 04:43:36 UTC 2017 Start Time = Fri Jul 21 04:56:34 UTC 2017 End Time = Fri Jul 21 14:43:36 UTC 2017 Renew Till = null Client AddressesNull
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType Krb5Context setting mySeqNumber to: 160498068 Created InitSecContextToken:
0000: 01 00 6E 82 02 62 30 82 02 5E A0 03 02 01 05 A1..n..b0..^...... 0010: 03 02 01 0E A2 07 03 05 00 20 00 00 00 A3 82 01......... ...... 0020: 6E 61 82 01 6A 30 82 01 66 A0 03 02 01 05 A1 0Cna..j0..f....... <truncated>
ls: Failed on local exception: java.io.IOException: Couldn't setup connection for user1@EXAMPLE.COM to namenode1001.example.com/192.168.1.232:8020; Host Details : local host is: "datanode1001.example.com/192.168.1.182"; destination host is: "namenode1001.example.com":8020;
user1@datanode1001:~$ klist
Ticket cache: FILE:/tmp/krb5cc_1993 Default principal: user1@EXAMPLE.COM Valid starting Expires . Service principal 07/20/2017 07:16:1307/20/2017 17:16:13krbtgt/EXAMPLE.COM@EXAMPLE.COMrenew until 07/27/2017 07:16:08
On namenode1001.example.com I see the following WARNING messages
2017-07-21 06:03:06,446 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.1.182:45320:null (Failure to initialize security context) 2017-07-21 06:03:06,446 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.1.182:45320:null (Failure to initialize security context) 2017-07-21 06:03:06,446 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 192.168.1.182 threw exception [javax.security.sasl.SaslException: Failure to initialize security context [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)]]
This issue will be fixed if I restart name node service on namenode1001.example.com. I had to do this every day first thing in the morning to fix the issue. Could this related to tgt renewal.
Created 07-22-2017 05:55 PM
Hello @Mazin Mohammed,
Yes, this looks like to be related to TGT renewal for NameNode credential. That's why it starts working when you restart (forces to get new NN credential). The quickest way to check would be to get a ticket for nn/<host> principal and do klist like this:
# kinit -kt <nn.service.keytab> nn/<host> # klist -eaf
The output above should give us some pointers. Please post the output here for all of us to see.
Hope this helps!
Created 07-24-2017 03:30 AM
Thanks for reply,
On datanode1001 I dont have the nn.service.keytab (Ambari did not deploy), thus I have executed kinit on namenode1001 itself.
user1@namenode1001:~$ sudo -u hdfs kinit -kt /etc/security/keytabs/nn.service.keytab nn/namenode1001.example.com@EXAMPLE.COM user1@namenode1001:~$ sudo klist -eaf Ticket cache: FILE:/tmp/krb5cc_0Default principal: nn/namenode1001.example.com@EXAMPLE.COM Valid starting ExpiresService principal 07/19/2017 06:30:27 07/19/2017 16:30:27 krbtgt/EXAMPLE.COM@EXAMPLE.COM renew until 07/20/2017 06:30:27, Flags: FPRIA Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96 Addresses: (none)
Created 07-22-2017 06:21 PM
Yes, it's related to Kerberos ticket issue, you can find the same in your Ambari->Kerberos->Kerberos Conf Template
[libdefaults]
renew_lifetime = 7d
forwardable = true default_realm = {{realm}}
ticket_lifetime = 24hr
Possible solutions:
1) You need to work with AD team and change the password expire policy. Ensure that NN & other Ambari related principle should never expire
Created 07-24-2017 03:45 AM
On KDC we did set for all NN and Ambari related principle never expires here is the snapshot of the same.
kadmin:getprinc nn/namenode1001.example.com@EXAMPLE.COM Principal: nn/namenode1001.example.com@EXAMPLE.COM Expiration date: [never] Password expiration date: [none] Maximum ticket life: 0 days 10:00:00 Maximum renewable life: 7 days 00:00:00
Created 07-23-2017 09:00 PM
Next time remember to upload the - OS type and version
- Ambari and HDP version
- Cluster size
On the KDC server under [REALMS] /etc/krb5kdc/kdc.conf
set the max_life = 14h 0m 0s (2 weeks)
Having said that did you copy the /etc/krb5.conf to all the node?
Then on the clients check that the krb5.conf ,you should have renew_lifetime = 7d in the krb5.conf this will ,mean you kerberos ticket is valid 7 days and normally it should auto renew
Hope that helps
Created 07-24-2017 03:39 AM
OS - Ubuntu 14.04 LTS trusty Ambari version - 2.4.20 HDP version - 2.2.4 Cluster size - 8 node (staging cluster)
Yes, I did copy krb5.conf to all the nodes. renew_lifetime = 7d (yes this is set).
Does this mean irrespective of what you set, ticket will get expired and we need to restart the service to renew the ticket ?
Created 07-24-2017 06:35 AM
How about the ntp setting across the cluster?
Can you run the below commands substituting with the correct values. First run the klit -kt to get the principal for namenode ...
# klist -kt /etc/security/keytabs/nn.service.keytab Keytab name: FILE:/etc/security/keytabs/nn.service.keytab KVNO Timestamp Principal ---- ------------------- ------------------------------------------------------ 1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM 1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM 1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM 1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM 1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM # kinit -kt /etc/security/keytabs/nn.service.keytab nn/my_fdqn.com@REALM.COM # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: nn/my_fdqn.com@REALM.COM Valid starting Expires Service principal 07/24/2017 06:30:56 07/25/2017 06:30:56 krbtgt/REALM.COM@REALM.COM
Created 07-24-2017 08:31 AM
user1@namenode1001:~$ sudo klist -kt /etc/security/keytabs/nn.service.keytab Keytab name: FILE:/etc/security/keytabs/nn.service.keytab KVNO Timestamp Principal ---- ------------------- ------------------------------------------------------ 10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM 10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM 10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM 10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM 10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM user1@namenode1001:~$ kinit -kt /etc/security/keytabs/nn.service.keytab nn/namenode1001.example.com@EXAMPLE.COM
user1@namenode1001:~$ klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: nn/namenode1001.example.com@EXAMPLE.COM Valid starting ExpiresService principal 07/24/2017 08:28:50 07/24/2017 18:28:50 krbtgt/EXAMPLE.COM@EXAMPLE.COM renew until 07/31/2017 08:28:49
Created 07-24-2017 12:46 PM
Your kerberos ticket should now expire in 7 days 07/31/2017 08:28:49 can you monitor to see if the nn again goes down?