Support Questions

hjort · ‎05-30-2018

After restarting a working HDP through Ambari, both two NameNodes entered in standby mode.

I also noted that ZKFailoverControllers failed to raised up.

Then, I performed these steps (through Ambari):

stopped all services
started only ZooKeeper service (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045)
started all Journal Nodes (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045)

In order to raise ZKFC manually, I tried the command below (using "hdfs" user in the Name Node "dfcdsrvbi0042" I elected to be the primary):

[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start zkfc

It had no success, giving the following results in the LOG (/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-dfcdsrvbi0042.localdomain.log):

STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG:   user = hdfs
STARTUP_MSG:   host = dfcdsrvbi0042.localdomain/161.148.242.43
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3.2.6.0.3-8
...
2018-05-29 15:18:02,724 INFO  tools.DFSZKFailoverController (LogAdapter.java:info(45)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-05-29 15:18:03,148 INFO  tools.DFSZKFailoverController (DFSZKFailoverController.java:<init>(153)) - Failover controller configured for NameNode NameNode at dfcdsrv
bi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,476 INFO  security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1101)) - Login successful for user nn/dfcdsrvbi0042.localdoma
in@MYHDP.COM.BR using keytab file /etc/security/keytabs/nn.service.keytab
2018-05-29 15:18:03,505 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=dfcdsrvbi0042.localdomain:2181,dfcdsrvbi004
3.localdomain:2181,dfcdsrvbi0045.localdomain:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@76b1e9b8
2018-05-29 15:18:03,527 INFO  zookeeper.Login (Login.java:login(294)) - successfully logged in.
2018-05-29 15:18:03,529 INFO  zookeeper.Login (Login.java:run(127)) - TGT refresh thread started.
2018-05-29 15:18:03,531 INFO  client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism.
2018-05-29 15:18:03,537 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server dfcdsrvbi0045.localdomain/161.148.242.4
6:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2018-05-29 15:18:03,538 INFO  zookeeper.Login (Login.java:getRefreshTime(302)) - TGT valid starting at:        Tue May 29 15:18:03 BRT 2018
2018-05-29 15:18:03,538 INFO  zookeeper.Login (Login.java:getRefreshTime(303)) - TGT expires:                  Sun Feb 07 04:28:15 BRST 2106
2018-05-29 15:18:03,539 INFO  zookeeper.Login (Login.java:run(181)) - TGT refresh sleeping until: Fri Jan 25 11:27:36 BRST 2092
2018-05-29 15:18:03,543 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to dfcdsrvbi0045.localdomain/161.148.242.46:2181, initiating session
2018-05-29 15:18:03,550 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server dfcdsrvbi0045.localdomain/161.148.242.46:2181, sessionid = 0x363acc9fc850003, negotiated timeout = 6000
2018-05-29 15:18:03,552 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
2018-05-29 15:18:03,559 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,559 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,665 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=2075809815 appData=null cb=Elector callbacks for NameNode at dfcdsrvbi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,665 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED
2018-05-29 15:18:03,666 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now
java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/MYHDPNNHA'
        at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:319)
        at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:218)
        at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
        at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
        at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:474)
        at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
        at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191)
Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hadoop-ha/MYHDPNNHA
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
        at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:317)
        ... 10 more
2018-05-28 15:50:12,167 INFO  tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dfcdsrvbi0042.localdomain/161.148.242.43

I even tried to reset the respective ZKFC data by issuing this instruction, but the same error occurred:

[hdfs@dfcdsrvbi0042 ~]$ hdfs zkfc -formatZK -force

In the KDC log (/var/log/krb5kdc.log in Kerberos), I notted these entries:

May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: PROCESS_TGS: authtime 0,  nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR, Ticket expired

Then I tried to manually renew the tickets by executing this (using "hdfs" user in the same Name Node "dfcdsrvbi0042"):

kdestroy
kinit -V -kt /etc/security/keytabs/nn.service.keytab nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
kvno zookeeper/dfcdsrvbi0042.localdomain
kvno zookeeper/dfcdsrvbi0043.localdomain
kvno zookeeper/dfcdsrvbi0045.localdomain
klist

Here were the results of "klist":

[hdfs@dfcdsrvbi0042 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1031
Default principal: nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR

Valid starting       Expires              Service principal
05/29/2018 16:35:21  05/30/2018 16:35:21  krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21  05/30/2018 16:35:21  zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR

Those later commands were reflected in the KDC log:

May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (8 etypes {18 17 16 23 20 19 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR

I also tried modifying "zookeeper_client_jaas.conf" (in /etc/zookeeper/2.6.0.3-8/0/ at the same "dfcdsrvbi0042" server) using these settings:

Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=true
keyTab="/etc/security/keytabs/zk.service.keytab"
principal="zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR";
};

But nothing new happened.

I could manually connect to the ZooKeeper service and retrieve the contents of the "/hadoop-ha/MYHDPNNHA" znode ZKFC was supposed to read:

[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/2.6.0.3-8/zookeeper/bin/zkCli.sh -server dfcdsrvbi0042.localdomain

[zk: dfcdsrvbi0042.localdomain(CONNECTED) 1] ls /
[hive, registry, zookeeper, infra-solr, hiveserver2-hive2, yarn-leader-election, hadoop-ha, services, llap-sasl, templeton-hadoop, zkdtsm_mp1_llap0, zkdtsm_hive_llap0, hiveserver2, rmstore, ambari-metrics-cluster, zkdtsm_04817003669_llap0]

[zk: dfcdsrvbi0042.localdomain(CONNECTED) 7] stat /hadoop-ha/MYHDPNNHA                 
cZxid = 0x200000007
ctime = Fri Apr 14 17:32:03 BRT 2017
mZxid = 0x200000007
mtime = Fri Apr 14 17:32:03 BRT 2017
pZxid = 0x2f0003c87a
cversion = 345
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 1

Therefore, I have no more ideas of what to try to restore my HDFS.

Please, what else should I do?

Any help is appreciated.

Thanks in advance!

Shelton · ‎05-30-2018

@Rodrigo Hjort

Have you ever restarted your cluster after kerberization?

The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands

Can you share the contents of

krb5.conf
kdc.conf
kadm5.acl

Make sure the Kerberos daemons are up and running

# kadmin status 
# krb5kdc status

If down

# service kadmin start 
# service krb5kdc start

Make sure NTPD is running so the time in your cluster is in sync.

To sum up, the ticket lifetime is the minimum of the following values:

max_life in kdc.conf on the KDC servers.
ticket_lifetime in krb5.conf on the client machine.
maxlife for the user principal.

Please revert

hjort · ‎05-30-2018

@Geoffrey Shelton Okot

> Have you ever restarted your cluster after kerberization?

Yes. I even restarted the operating system at all nodes.

> The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands

It's a CentOS Linux release 7.3.1611 (Core).

Here are the contents of Kerberos configuration files:

1) /etc/krb5.conf

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = MYHDP.COM.BR
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[domain_realm]
  MYHDP.COM.BR = MYHDP.COM.BR

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  MYHDP.COM.BR = {
    admin_server = 10.30.242.46
    kdc = 10.30.242.46
  }

2) /var/kerberos/krb5kdc/kdc.conf

[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 MYHDP.COM.BR = {
  #master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
 }

3) /var/kerberos/krb5kdc/kadm5.acl

*/admin@MYHDP.COM.BR *

And here are the Kerberos services statuses:

4) kadmin

[root@dfcdsrvbi0045 ~]# service kadmin status
Redirecting to /bin/systemctl status kadmin.service
● kadmin.service - Kerberos 5 Password-changing and Administration
   Loaded: loaded (/usr/lib/systemd/system/kadmin.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
 Main PID: 1998 (kadmind)
   CGroup: /system.slice/kadmin.service
           └─1998 /usr/sbin/kadmind -P /var/run/kadmind.pid

5) krb5kdc

[root@dfcdsrvbi0045 ~]# service krb5kdc status
Redirecting to /bin/systemctl status krb5kdc.service
● krb5kdc.service - Kerberos 5 KDC
   Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
 Main PID: 1950 (krb5kdc)
   CGroup: /system.slice/krb5kdc.service
           └─1950 /usr/sbin/krb5kdc -P /var/run/krb5kdc.pid

> Make sure NTPD is running so the time in your cluster is in sync.

I've already manually set all times in the cluster nodes to a given NTP server through "ntpdate" (see output below).

Moreover, I haven't found any "clock skew too great" messages in the KDC log files.

[root@dfcdsrvbi0045 ~]# ./ntpdate-all.sh
Comando: ntpdate -u dfcddom0002.otherdomain

[10.30.242.11]
30 May 14:59:46 ntpdate[46569]: adjust time server 10.30.14.150 offset 0.017012 sec
[10.30.242.12]
30 May 14:59:53 ntpdate[8593]: adjust time server 10.30.14.150 offset 0.012909 sec
[10.30.242.13]
30 May 14:59:59 ntpdate[64344]: adjust time server 10.30.14.151 offset 0.025877 sec
[10.30.242.22]
30 May 15:00:06 ntpdate[8161]: adjust time server 10.30.14.150 offset 0.001915 sec
[10.30.242.24]
30 May 15:00:12 ntpdate[79351]: adjust time server 10.30.14.150 offset 0.025769 sec
[10.30.242.25]
30 May 15:00:19 ntpdate[43070]: adjust time server 10.30.14.150 offset 0.011081 sec
[10.30.242.28]
30 May 15:00:26 ntpdate[76036]: adjust time server 10.30.14.150 offset 0.016353 sec
[10.30.242.43]
30 May 15:00:32 ntpdate[25763]: adjust time server 10.30.14.151 offset 0.007244 sec
[10.30.242.44]
30 May 15:00:39 ntpdate[16081]: adjust time server 10.30.14.151 offset -0.001653 sec
[10.30.242.45]
30 May 15:00:45 ntpdate[9834]: adjust time server 10.30.14.150 offset 0.012155 sec
[10.30.242.46]
30 May 15:00:51 ntpdate[10009]: adjust time server 10.30.14.151 offset 0.011018 sec
[10.30.242.50]
30 May 15:00:58 ntpdate[63572]: adjust time server 10.30.14.150 offset 0.005541 sec
[10.30.242.51]
30 May 15:01:05 ntpdate[24894]: adjust time server 10.30.14.151 offset -0.008577 sec
[10.30.242.53]
30 May 15:01:05 ntpdate[24813]: adjust time server 10.30.14.150 offset 0.024188 sec

The rest of the components of HDFS service (journal nodes, data nodes, and name nodes) seem functional, as illustrated in the picture below:

The problem is that both Name Nodes are stuck in "standby" mode and ZKFailoverControllers are offline.

Therefore, I couldn't start other cluster services dependent of HDFS (i.e., YARN, Hive, etc).

Any clue?

Shelton · ‎05-30-2018

@Rodrigo Hjort

The files look okay except the krb5.conf notice the entry [domain_realm] please change it on the kdc and copy it to all the nodes and restart the KDC services(kadmin & Krb5kdc), such a simple thing could be the problem

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = MYHDP.COM.BR
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[domain_realm]
 .myhdp.com.br = MYHDP.COM.BR
  myhdp.com.br = MYHDP.COM.BR

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  MYHDP.COM.BR = {
    admin_server = 10.30.242.46
    kdc = 10.30.242.46
  }

How is your /etc/hosts entry? No mixed cases

Why is Ranger,Yarn,MR ,AMS disabled/maintenance mode? Had you enabled ranger plugins?

Please revert

rmaruthiyodan · ‎10-05-2018

This could be KDC server issue. Please check if you are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1560951

Shelton · ‎10-05-2018

@Rodrigo Hjort

Any updates on this thread, it should be long closed. Please take time and accept a provided solution

geko · ‎02-28-2019

Hi @Rodrigo Hjort ,

did you solve this problem and if yes, how ?

Cloudera Community

Support Questions

ZKFC fails to start apparently due to expired Kerberos tickets (both two namenodes are stuck in standby mode)