Support Questions

Find answers, ask questions, and share your expertise

Namenode needs restart everyday after setting up kerberos

Team,

We have enabled Kerberos on our staging cluster. Its noticed that everyday morning when we try to execute hadoop command like hadoop fs -ls / it fails with error. To fix the following we need to restart the namenodes.

user1@datanode1001:~$ hadoop fs -ls /

<Truncated>
Found ticket for user1@EXAMPLE.COM to go to krbtgt/EXAMPLE.COM@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for user1@EXAMPLE.COM to go to krbtgt/EXAMPLE.COM@EXAMPLE.COMexpiring on Fri Jul 21 14:43:36 UTC 2017
Found ticket for user1@EXAMPLE.COM to go to nn/namenode1001.example.com@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017
Found ticket for user1@EXAMPLE.COM to go to nn/namenode1002.example.com@EXAMPLE.COM expiring on Fri Jul 21 14:43:36 UTC 2017
Found service ticket in the subjectTicket (hex) =
0000: 61 82 01 6A 30 82 01 66 A0 03 02 01 05 A1 0C 1Ba..j0..f........
0010: 0A 49 4E 4D 4F 42 49 2E 43 4F 4D A2 31 30 2F A0.EXAMPLE.COM.10/.
<trucated>Client Principal = user1@EXAMPLE.COM
Server Principal = nn/namenode1001.example.com@EXAMPLE.COM
Session Key = EncryptionKey: keyType=18 keyBytes (hex dump)=
0000: 23 68 3A EA 16 21 D4 B9 31 0E C6 F8 C6 39 8D 4E#h:..!..1....9.N
0010: 99 68 B8 6A C5 90 E9 E6 3B 17 08 A0 2E C0 AE 48.h.j....;......H <truncated>
Forwardable Ticket true
Forwarded Ticket false
Proxiable Ticket false
Proxy Ticket false
Postdated Ticket false
Renewable Ticket false
Initial Ticket false
Auth Time = Fri Jul 21 04:43:36 UTC 2017
Start Time = Fri Jul 21 04:56:34 UTC 2017
End Time = Fri Jul 21 14:43:36 UTC 2017
Renew Till = null
Client AddressesNull
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
Krb5Context setting mySeqNumber to: 160498068
Created InitSecContextToken:
0000: 01 00 6E 82 02 62 30 82 02 5E A0 03 02 01 05 A1..n..b0..^......
0010: 03 02 01 0E A2 07 03 05 00 20 00 00 00 A3 82 01......... ......
0020: 6E 61 82 01 6A 30 82 01 66 A0 03 02 01 05 A1 0Cna..j0..f.......
<truncated>
ls: Failed on local exception: java.io.IOException: Couldn't setup connection for user1@EXAMPLE.COM to namenode1001.example.com/192.168.1.232:8020; Host Details : local host is: "datanode1001.example.com/192.168.1.182"; destination host is: "namenode1001.example.com":8020;
user1@datanode1001:~$ klist 
Ticket cache: FILE:/tmp/krb5cc_1993
Default principal: user1@EXAMPLE.COM
Valid starting    Expires . Service principal
07/20/2017 07:16:1307/20/2017 17:16:13krbtgt/EXAMPLE.COM@EXAMPLE.COMrenew until 07/27/2017 07:16:08

On namenode1001.example.com I see the following WARNING messages

2017-07-21 06:03:06,446 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.1.182:45320:null (Failure to initialize security context)
2017-07-21 06:03:06,446 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.1.182:45320:null (Failure to initialize security context)
2017-07-21 06:03:06,446 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 192.168.1.182 threw exception [javax.security.sasl.SaslException: Failure to initialize security context [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)]]

This issue will be fixed if I restart name node service on namenode1001.example.com. I had to do this every day first thing in the morning to fix the issue. Could this related to tgt renewal.

17 REPLIES 17

Guru

Hello @Mazin Mohammed,

Yes, this looks like to be related to TGT renewal for NameNode credential. That's why it starts working when you restart (forces to get new NN credential). The quickest way to check would be to get a ticket for nn/<host> principal and do klist like this:

# kinit -kt <nn.service.keytab> nn/<host>
# klist -eaf

The output above should give us some pointers. Please post the output here for all of us to see.

Hope this helps!

Thanks for reply,

On datanode1001 I dont have the nn.service.keytab (Ambari did not deploy), thus I have executed kinit on namenode1001 itself.

user1@namenode1001:~$ sudo -u hdfs kinit -kt /etc/security/keytabs/nn.service.keytab nn/namenode1001.example.com@EXAMPLE.COM

user1@namenode1001:~$ sudo klist -eaf
Ticket cache: FILE:/tmp/krb5cc_0Default principal: nn/namenode1001.example.com@EXAMPLE.COM
Valid starting       ExpiresService       principal
07/19/2017 06:30:27  07/19/2017 16:30:27  krbtgt/EXAMPLE.COM@EXAMPLE.COM
renew until 07/20/2017 06:30:27, Flags: FPRIA
Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96
Addresses: (none)

Yes, it's related to Kerberos ticket issue, you can find the same in your Ambari->Kerberos->Kerberos Conf Template

[libdefaults]

renew_lifetime = 7d

forwardable = true default_realm = {{realm}}

ticket_lifetime = 24hr

Possible solutions:

1) You need to work with AD team and change the password expire policy. Ensure that NN & other Ambari related principle should never expire

On KDC we did set for all NN and Ambari related principle never expires here is the snapshot of the same.

kadmin:getprinc nn/namenode1001.example.com@EXAMPLE.COM
Principal: nn/namenode1001.example.com@EXAMPLE.COM
Expiration date: [never]
Password expiration date: [none]
Maximum ticket life: 0 days 10:00:00
Maximum renewable life: 7 days 00:00:00

Mentor

@Mazin Mohammed

Next time remember to upload the - OS type and version

- Ambari and HDP version

- Cluster size

On the KDC server under [REALMS] /etc/krb5kdc/kdc.conf

set the max_life = 14h 0m 0s (2 weeks)

Having said that did you copy the /etc/krb5.conf to all the node?

Then on the clients check that the krb5.conf ,you should have renew_lifetime = 7d in the krb5.conf this will ,mean you kerberos ticket is valid 7 days and normally it should auto renew

Hope that helps

OS - Ubuntu 14.04 LTS trusty Ambari version - 2.4.20 HDP version - 2.2.4 Cluster size - 8 node (staging cluster)

Yes, I did copy krb5.conf to all the nodes. renew_lifetime = 7d (yes this is set).

Does this mean irrespective of what you set, ticket will get expired and we need to restart the service to renew the ticket ?

Mentor

@Mazin Mohammed

How about the ntp setting across the cluster?

Can you run the below commands substituting with the correct values. First run the klit -kt to get the principal for namenode ...

# klist -kt /etc/security/keytabs/nn.service.keytab
Keytab name: FILE:/etc/security/keytabs/nn.service.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM
   1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM
   1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM
   1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM
   1 07/18/2017 08:49:43 nn/my_fdqn.com@REALM.COM
# kinit -kt /etc/security/keytabs/nn.service.keytab nn/my_fdqn.com@REALM.COM
# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: nn/my_fdqn.com@REALM.COM
Valid starting       Expires              Service principal
07/24/2017 06:30:56  07/25/2017 06:30:56  krbtgt/REALM.COM@REALM.COM

user1@namenode1001:~$ sudo klist -kt /etc/security/keytabs/nn.service.keytab 

Keytab name: FILE:/etc/security/keytabs/nn.service.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
  10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM
  10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM
  10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM
  10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM
  10 07/19/2017 07:54:02 nn/namenode1001.example.com@EXAMPLE.COM

user1@namenode1001:~$ kinit -kt /etc/security/keytabs/nn.service.keytab nn/namenode1001.example.com@EXAMPLE.COM
user1@namenode1001:~$ klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: nn/namenode1001.example.com@EXAMPLE.COM
Valid starting       ExpiresService       principal
07/24/2017 08:28:50  07/24/2017 18:28:50  krbtgt/EXAMPLE.COM@EXAMPLE.COM
renew until 07/31/2017 08:28:49

Mentor

@Mazin Mohammed

Your kerberos ticket should now expire in 7 days 07/31/2017 08:28:49 can you monitor to see if the nn again goes down?

Yes, I am seeing the error now which will be solved if namenode restars.

@Mazin Mohamme

for me it seems your ticket is going expire again in 24 hr."ExpiresService 07/24/2017 18:28:50"

I think you need to change the ticket_lifetime = 7d as well.

Mentor

@Divakar Annapureddy

No the expiry date is "renew until 07/31/2017 08:28:49"

07/24/2017 08:28:50  07/24/2017 18:28:50  krbtgt/EXAMPLE.COM@EXAMPLE.COM

renew until 07/31/2017 08:28:49

So it evident that

- ticket will get expired and when it expires then service needed to restart. Isn't this a costly operation where you need to restart namenode every week or daily depending on ticket_lifetime.

Is there a way to get this auto renewed or never_expire.

Mentor

@@Mazin Mohammed

Any updates on this issue?

issue still persists 🙂 Thanks

@Vipin Rathor @Geoffrey Shelton Okot I think by giving too many information I might have caused confusion :). Here is the exact traceback of what I do and where I end up with error.

mazin.mohammed@gateway1005:~$ hadoop fs -ls /
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_1993
ls: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "gateway1005.example.com/192.168.141.182"; destination host is: "namenode1002.example.com":8020;

The above error clearly mention user mazin.mohammed tgt is expired. I do a kinit to renew my ticket

mazin.mohammed@gateway1005:~$ kinitPassword for mazin.mohammed@EXAMPLE.COM:

KDC - Server Log

Jul 28 05:23:15 kdc1000 krb5kdc[959]: AS_REQ (6 etypes {18 17 16 23 25 26}) 192.168.141.182: NEEDED_PREAUTH: mazin.mohammed@EXAMPLE.COM for krbtgt/EXAMPLE.COM@EXAMPLE.COM, Additional pre-authentication required
Jul 28 05:23:34 kdc1000 krb5kdc[959]: AS_REQ (6 etypes {18 17 16 23 25 26}) 192.168.141.182: ISSUE: authtime 123456, etypes {rep=18 tkt=18 ses=18}, mazin.mohammed@EXAMPLE.COM for krbtgt/EXAMPLE.COM@EXAMPLE.COM
mazin.mohammed@gateway1005:~$ klist 
Ticket cache: FILE:/tmp/krb5cc_1993
Default principal: mazin.mohammed@EXAMPLE.COM
Valid starting       ExpiresService       principal
07/28/2017 05:23:34  07/28/2017 15:23:34  krbtgt/EXAMPLE.COM@EXAMPLE.COM
renew until 08/04/2017 05:23:15

Now I try to access the cluster via hadoop command and this is where it ends up in the error even though I have valid TGT present in my cache.

mazin.mohammed@gateway1005:~$ hadoop fs -ls /
Found ticket for mazin.mohammed@EXAMPLE.COM to go to krbtgt/EXAMPLE.COM@EXAMPLE.COM expiring on Fri Jul 28 15:23:34 UTC 2017
Found ticket for mazin.mohammed@EXAMPLE.COM to go to nn/namenode1001.example.com@EXAMPLE.COM expiring on Fri Jul 28 15:23:34 UTC 2017

ls: Failed on local exception: java.io.IOException: Couldn't setup connection for mazin.mohammed@EXAMPLE.COM to namenode1001.example.com/192.168.140.232:8020; Host Details : local host is: "gateway.example.com/192.168.141.182"; destination host is: "namenode1001.example.com":8020;

KDC server.log

Jul 28 05:24:15 infosec1000 krb5kdc[959]: TGS_REQ (4 etypes {18 17 16 23}) 192.168.141.182: ISSUE: authtime 1501219414, etypes {rep=18 tkt=18 ses=18}, mazin.mohammed@EXAMPLE.COM for nn/namenode1001.example.com@EXAMPLE.COM
Jul 28 05:24:18 infosec1000 krb5kdc[959]: TGS_REQ (4 etypes {18 17 16 23}) 192.168.141.182: ISSUE: authtime 1501219414, etypes {rep=18 tkt=18 ses=18}, mazin.mohammed@EXAMPLE.COM for nn/namenode1001.example.com@EXAMPLE.COM

Namenode log

2017-07-28 06:20:52,151 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.141.182:58192:null (Failure to initialize security context)
2017-07-28 06:20:52,151 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 192.168.141.182:58192:null (Failure to initialize security context)
2017-07-28 06:20:52,151 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 192.168.141.182 threw exception [javax.security.sasl.SaslException: Failure to initialize security context [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)]]

In this case I am the client and I do have a TGT and service ticket issued by TGS to access hadoop cluster but namenode is refusing by throwing an exception threw exception [javax.security.sasl.SaslException: Failure to initialize security context [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)]].

As said in the thread this issue is solved if I restart the namenode. I hope this will give some more insights to the issue we are facing.

Guru

Hello @Mazin Mohammed,

The "Failed to find any Kerberos credentials" error in Namenode log, could also mean wrong / bad Kerberos credential for Namenode process. It does not necessarily always mean client credential.

Please check during NameNode startup that if the daemon was able to use nn.service.keytab and spnego.service.keytab both to secure a kerberos credential for itself. You also might want to enable Kerberos debug for NameNode by adding "-Dsun.security.krb5.debug=true" to Namenode command-line argument list.

Let us know. Happy hunting !

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.