Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

ZKFC fails to start apparently due to expired Kerberos tickets (both two namenodes are stuck in standby mode)

avatar
New Contributor

After restarting a working HDP through Ambari, both two NameNodes entered in standby mode.

I also noted that ZKFailoverControllers failed to raised up.

Then, I performed these steps (through Ambari):

  1. stopped all services
  2. started only ZooKeeper service (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045)
  3. started all Journal Nodes (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045)

In order to raise ZKFC manually, I tried the command below (using "hdfs" user in the Name Node "dfcdsrvbi0042" I elected to be the primary):

[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start zkfc

It had no success, giving the following results in the LOG (/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-dfcdsrvbi0042.localdomain.log):

STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG:   user = hdfs
STARTUP_MSG:   host = dfcdsrvbi0042.localdomain/161.148.242.43
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3.2.6.0.3-8
...
2018-05-29 15:18:02,724 INFO  tools.DFSZKFailoverController (LogAdapter.java:info(45)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-05-29 15:18:03,148 INFO  tools.DFSZKFailoverController (DFSZKFailoverController.java:<init>(153)) - Failover controller configured for NameNode NameNode at dfcdsrv
bi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,476 INFO  security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1101)) - Login successful for user nn/dfcdsrvbi0042.localdoma
in@MYHDP.COM.BR using keytab file /etc/security/keytabs/nn.service.keytab
2018-05-29 15:18:03,505 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=dfcdsrvbi0042.localdomain:2181,dfcdsrvbi004
3.localdomain:2181,dfcdsrvbi0045.localdomain:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@76b1e9b8
2018-05-29 15:18:03,527 INFO  zookeeper.Login (Login.java:login(294)) - successfully logged in.
2018-05-29 15:18:03,529 INFO  zookeeper.Login (Login.java:run(127)) - TGT refresh thread started.
2018-05-29 15:18:03,531 INFO  client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism.
2018-05-29 15:18:03,537 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server dfcdsrvbi0045.localdomain/161.148.242.4
6:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2018-05-29 15:18:03,538 INFO  zookeeper.Login (Login.java:getRefreshTime(302)) - TGT valid starting at:        Tue May 29 15:18:03 BRT 2018
2018-05-29 15:18:03,538 INFO  zookeeper.Login (Login.java:getRefreshTime(303)) - TGT expires:                  Sun Feb 07 04:28:15 BRST 2106
2018-05-29 15:18:03,539 INFO  zookeeper.Login (Login.java:run(181)) - TGT refresh sleeping until: Fri Jan 25 11:27:36 BRST 2092
2018-05-29 15:18:03,543 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to dfcdsrvbi0045.localdomain/161.148.242.46:2181, initiating session
2018-05-29 15:18:03,550 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server dfcdsrvbi0045.localdomain/161.148.242.46:2181, sessionid = 0x363acc9fc850003, negotiated timeout = 6000
2018-05-29 15:18:03,552 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
2018-05-29 15:18:03,559 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,559 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,665 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=2075809815 appData=null cb=Elector callbacks for NameNode at dfcdsrvbi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,665 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED
2018-05-29 15:18:03,666 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now
java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/MYHDPNNHA'
        at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:319)
        at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:218)
        at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
        at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
        at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:474)
        at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
        at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191)
Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hadoop-ha/MYHDPNNHA
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
        at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:317)
        ... 10 more
2018-05-28 15:50:12,167 INFO  tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dfcdsrvbi0042.localdomain/161.148.242.43

I even tried to reset the respective ZKFC data by issuing this instruction, but the same error occurred:

[hdfs@dfcdsrvbi0042 ~]$ hdfs zkfc -formatZK -force

In the KDC log (/var/log/krb5kdc.log in Kerberos), I notted these entries:

May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: PROCESS_TGS: authtime 0,  nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR, Ticket expired

Then I tried to manually renew the tickets by executing this (using "hdfs" user in the same Name Node "dfcdsrvbi0042"):

kdestroy
kinit -V -kt /etc/security/keytabs/nn.service.keytab nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
kvno zookeeper/dfcdsrvbi0042.localdomain
kvno zookeeper/dfcdsrvbi0043.localdomain
kvno zookeeper/dfcdsrvbi0045.localdomain
klist

Here were the results of "klist":

[hdfs@dfcdsrvbi0042 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1031
Default principal: nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR

Valid starting       Expires              Service principal
05/29/2018 16:35:21  05/30/2018 16:35:21  krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR

Those later commands were reflected in the KDC log:

May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (8 etypes {18 17 16 23 20 19 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR

I also tried modifying "zookeeper_client_jaas.conf" (in /etc/zookeeper/2.6.0.3-8/0/ at the same "dfcdsrvbi0042" server) using these settings:

Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=true
keyTab="/etc/security/keytabs/zk.service.keytab"
principal="zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR";
};

But nothing new happened.

I could manually connect to the ZooKeeper service and retrieve the contents of the "/hadoop-ha/MYHDPNNHA" znode ZKFC was supposed to read:

[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/2.6.0.3-8/zookeeper/bin/zkCli.sh -server dfcdsrvbi0042.localdomain

[zk: dfcdsrvbi0042.localdomain(CONNECTED) 1] ls /
[hive, registry, zookeeper, infra-solr, hiveserver2-hive2, yarn-leader-election, hadoop-ha, services, llap-sasl, templeton-hadoop, zkdtsm_mp1_llap0, zkdtsm_hive_llap0, hiveserver2, rmstore, ambari-metrics-cluster, zkdtsm_04817003669_llap0]

[zk: dfcdsrvbi0042.localdomain(CONNECTED) 7] stat /hadoop-ha/MYHDPNNHA                 
cZxid = 0x200000007
ctime = Fri Apr 14 17:32:03 BRT 2017
mZxid = 0x200000007
mtime = Fri Apr 14 17:32:03 BRT 2017
pZxid = 0x2f0003c87a
cversion = 345
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 1

Therefore, I have no more ideas of what to try to restore my HDFS.

Please, what else should I do?

Any help is appreciated.

Thanks in advance!

6 REPLIES 6

avatar
Master Mentor

@Rodrigo Hjort

Have you ever restarted your cluster after kerberization?

The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands

Can you share the contents of

  • krb5.conf
  • kdc.conf
  • kadm5.acl

Make sure the Kerberos daemons are up and running

# kadmin status 
# krb5kdc status 

If down

# service kadmin start 
# service krb5kdc start 

Make sure NTPD is running so the time in your cluster is in sync.

To sum up, the ticket lifetime is the minimum of the following values:

  • max_life in kdc.conf on the KDC servers.
  • ticket_lifetime in krb5.conf on the client machine.
  • maxlife for the user principal.

Please revert



avatar
New Contributor

@Geoffrey Shelton Okot

> Have you ever restarted your cluster after kerberization?

Yes. I even restarted the operating system at all nodes.

> The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands

It's a CentOS Linux release 7.3.1611 (Core).

Here are the contents of Kerberos configuration files:

1) /etc/krb5.conf

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = MYHDP.COM.BR
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[domain_realm]
  MYHDP.COM.BR = MYHDP.COM.BR

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  MYHDP.COM.BR = {
    admin_server = 10.30.242.46
    kdc = 10.30.242.46
  }

2) /var/kerberos/krb5kdc/kdc.conf

[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 MYHDP.COM.BR = {
  #master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
 }

3) /var/kerberos/krb5kdc/kadm5.acl

*/admin@MYHDP.COM.BR *

And here are the Kerberos services statuses:

4) kadmin

[root@dfcdsrvbi0045 ~]# service kadmin status
Redirecting to /bin/systemctl status kadmin.service
● kadmin.service - Kerberos 5 Password-changing and Administration
   Loaded: loaded (/usr/lib/systemd/system/kadmin.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
 Main PID: 1998 (kadmind)
   CGroup: /system.slice/kadmin.service
           └─1998 /usr/sbin/kadmind -P /var/run/kadmind.pid

5) krb5kdc

[root@dfcdsrvbi0045 ~]# service krb5kdc status
Redirecting to /bin/systemctl status krb5kdc.service
● krb5kdc.service - Kerberos 5 KDC
   Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
 Main PID: 1950 (krb5kdc)
   CGroup: /system.slice/krb5kdc.service
           └─1950 /usr/sbin/krb5kdc -P /var/run/krb5kdc.pid

> Make sure NTPD is running so the time in your cluster is in sync.

I've already manually set all times in the cluster nodes to a given NTP server through "ntpdate" (see output below).

Moreover, I haven't found any "clock skew too great" messages in the KDC log files.

[root@dfcdsrvbi0045 ~]# ./ntpdate-all.sh
Comando: ntpdate -u dfcddom0002.otherdomain

[10.30.242.11]
30 May 14:59:46 ntpdate[46569]: adjust time server 10.30.14.150 offset 0.017012 sec
[10.30.242.12]
30 May 14:59:53 ntpdate[8593]: adjust time server 10.30.14.150 offset 0.012909 sec
[10.30.242.13]
30 May 14:59:59 ntpdate[64344]: adjust time server 10.30.14.151 offset 0.025877 sec
[10.30.242.22]
30 May 15:00:06 ntpdate[8161]: adjust time server 10.30.14.150 offset 0.001915 sec
[10.30.242.24]
30 May 15:00:12 ntpdate[79351]: adjust time server 10.30.14.150 offset 0.025769 sec
[10.30.242.25]
30 May 15:00:19 ntpdate[43070]: adjust time server 10.30.14.150 offset 0.011081 sec
[10.30.242.28]
30 May 15:00:26 ntpdate[76036]: adjust time server 10.30.14.150 offset 0.016353 sec
[10.30.242.43]
30 May 15:00:32 ntpdate[25763]: adjust time server 10.30.14.151 offset 0.007244 sec
[10.30.242.44]
30 May 15:00:39 ntpdate[16081]: adjust time server 10.30.14.151 offset -0.001653 sec
[10.30.242.45]
30 May 15:00:45 ntpdate[9834]: adjust time server 10.30.14.150 offset 0.012155 sec
[10.30.242.46]
30 May 15:00:51 ntpdate[10009]: adjust time server 10.30.14.151 offset 0.011018 sec
[10.30.242.50]
30 May 15:00:58 ntpdate[63572]: adjust time server 10.30.14.150 offset 0.005541 sec
[10.30.242.51]
30 May 15:01:05 ntpdate[24894]: adjust time server 10.30.14.151 offset -0.008577 sec
[10.30.242.53]
30 May 15:01:05 ntpdate[24813]: adjust time server 10.30.14.150 offset 0.024188 sec

The rest of the components of HDFS service (journal nodes, data nodes, and name nodes) seem functional, as illustrated in the picture below:

77459-hortonworks-community-zkfc-ambari-968u.png

The problem is that both Name Nodes are stuck in "standby" mode and ZKFailoverControllers are offline.

Therefore, I couldn't start other cluster services dependent of HDFS (i.e., YARN, Hive, etc).

Any clue?

avatar
Master Mentor

@Rodrigo Hjort

The files look okay except the krb5.conf notice the entry [domain_realm] please change it on the kdc and copy it to all the nodes and restart the KDC services(kadmin & Krb5kdc), such a simple thing could be the problem

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = MYHDP.COM.BR
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[domain_realm]
 .myhdp.com.br = MYHDP.COM.BR
  myhdp.com.br = MYHDP.COM.BR

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  MYHDP.COM.BR = {
    admin_server = 10.30.242.46
    kdc = 10.30.242.46
  }

How is your /etc/hosts entry? No mixed cases

Why is Ranger,Yarn,MR ,AMS disabled/maintenance mode? Had you enabled ranger plugins?

Please revert

avatar
Expert Contributor

This could be KDC server issue. Please check if you are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1560951

avatar
Master Mentor

@Rodrigo Hjort

Any updates on this thread, it should be long closed. Please take time and accept a provided solution

avatar
Guru

Hi @Rodrigo Hjort ,

did you solve this problem and if yes, how ?