Created 05-30-2018 11:18 AM
After restarting a working HDP through Ambari, both two NameNodes entered in standby mode.
I also noted that ZKFailoverControllers failed to raised up.
Then, I performed these steps (through Ambari):
In order to raise ZKFC manually, I tried the command below (using "hdfs" user in the Name Node "dfcdsrvbi0042" I elected to be the primary):
[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start zkfc
It had no success, giving the following results in the LOG (/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-dfcdsrvbi0042.localdomain.log):
STARTUP_MSG: Starting DFSZKFailoverController STARTUP_MSG: user = hdfs STARTUP_MSG: host = dfcdsrvbi0042.localdomain/161.148.242.43 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.3.2.6.0.3-8 ... 2018-05-29 15:18:02,724 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - registered UNIX signal handlers for [TERM, HUP, INT] 2018-05-29 15:18:03,148 INFO tools.DFSZKFailoverController (DFSZKFailoverController.java:<init>(153)) - Failover controller configured for NameNode NameNode at dfcdsrv bi0042.localdomain/161.148.242.43:8020 2018-05-29 15:18:03,476 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1101)) - Login successful for user nn/dfcdsrvbi0042.localdoma in@MYHDP.COM.BR using keytab file /etc/security/keytabs/nn.service.keytab 2018-05-29 15:18:03,505 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=dfcdsrvbi0042.localdomain:2181,dfcdsrvbi004 3.localdomain:2181,dfcdsrvbi0045.localdomain:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@76b1e9b8 2018-05-29 15:18:03,527 INFO zookeeper.Login (Login.java:login(294)) - successfully logged in. 2018-05-29 15:18:03,529 INFO zookeeper.Login (Login.java:run(127)) - TGT refresh thread started. 2018-05-29 15:18:03,531 INFO client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism. 2018-05-29 15:18:03,537 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server dfcdsrvbi0045.localdomain/161.148.242.4 6:2181. Will attempt to SASL-authenticate using Login Context section 'Client' 2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(302)) - TGT valid starting at: Tue May 29 15:18:03 BRT 2018 2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(303)) - TGT expires: Sun Feb 07 04:28:15 BRST 2106 2018-05-29 15:18:03,539 INFO zookeeper.Login (Login.java:run(181)) - TGT refresh sleeping until: Fri Jan 25 11:27:36 BRST 2092 2018-05-29 15:18:03,543 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to dfcdsrvbi0045.localdomain/161.148.242.46:2181, initiating session 2018-05-29 15:18:03,550 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server dfcdsrvbi0045.localdomain/161.148.242.46:2181, sessionid = 0x363acc9fc850003, negotiated timeout = 6000 2018-05-29 15:18:03,552 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected. 2018-05-29 15:18:03,559 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. 2018-05-29 15:18:03,559 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. 2018-05-29 15:18:03,665 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Unexpected Zookeeper watch event state: AuthFailed 2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=2075809815 appData=null cb=Elector callbacks for NameNode at dfcdsrvbi0042.localdomain/161.148.242.43:8020 2018-05-29 15:18:03,665 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed 2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED 2018-05-29 15:18:03,666 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/MYHDPNNHA' at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:319) at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:218) at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:474) at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168) at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191) Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hadoop-ha/MYHDPNNHA at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:317) ... 10 more 2018-05-28 15:50:12,167 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG: SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dfcdsrvbi0042.localdomain/161.148.242.43
I even tried to reset the respective ZKFC data by issuing this instruction, but the same error occurred:
[hdfs@dfcdsrvbi0042 ~]$ hdfs zkfc -formatZK -force
In the KDC log (/var/log/krb5kdc.log in Kerberos), I notted these entries:
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: PROCESS_TGS: authtime 0, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR, Ticket expired
Then I tried to manually renew the tickets by executing this (using "hdfs" user in the same Name Node "dfcdsrvbi0042"):
kdestroy kinit -V -kt /etc/security/keytabs/nn.service.keytab nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR kvno zookeeper/dfcdsrvbi0042.localdomain kvno zookeeper/dfcdsrvbi0043.localdomain kvno zookeeper/dfcdsrvbi0045.localdomain klist
Here were the results of "klist":
[hdfs@dfcdsrvbi0042 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_1031 Default principal: nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR Valid starting Expires Service principal 05/29/2018 16:35:21 05/30/2018 16:35:21 krbtgt/MYHDP.COM.BR@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR
Those later commands were reflected in the KDC log:
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (8 etypes {18 17 16 23 20 19 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR
I also tried modifying "zookeeper_client_jaas.conf" (in /etc/zookeeper/2.6.0.3-8/0/ at the same "dfcdsrvbi0042" server) using these settings:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true useTicketCache=true keyTab="/etc/security/keytabs/zk.service.keytab" principal="zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR"; };
But nothing new happened.
I could manually connect to the ZooKeeper service and retrieve the contents of the "/hadoop-ha/MYHDPNNHA" znode ZKFC was supposed to read:
[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/2.6.0.3-8/zookeeper/bin/zkCli.sh -server dfcdsrvbi0042.localdomain [zk: dfcdsrvbi0042.localdomain(CONNECTED) 1] ls / [hive, registry, zookeeper, infra-solr, hiveserver2-hive2, yarn-leader-election, hadoop-ha, services, llap-sasl, templeton-hadoop, zkdtsm_mp1_llap0, zkdtsm_hive_llap0, hiveserver2, rmstore, ambari-metrics-cluster, zkdtsm_04817003669_llap0] [zk: dfcdsrvbi0042.localdomain(CONNECTED) 7] stat /hadoop-ha/MYHDPNNHA cZxid = 0x200000007 ctime = Fri Apr 14 17:32:03 BRT 2017 mZxid = 0x200000007 mtime = Fri Apr 14 17:32:03 BRT 2017 pZxid = 0x2f0003c87a cversion = 345 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 1
Therefore, I have no more ideas of what to try to restore my HDFS.
Please, what else should I do?
Any help is appreciated.
Thanks in advance!
Created 05-30-2018 03:56 PM
Have you ever restarted your cluster after kerberization?
The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands
Can you share the contents of
Make sure the Kerberos daemons are up and running
# kadmin status # krb5kdc status
If down
# service kadmin start # service krb5kdc start
Make sure NTPD is running so the time in your cluster is in sync.
To sum up, the ticket lifetime is the minimum of the following values:
Please revert
Created on 05-30-2018 06:49 PM - edited 08-17-2019 08:49 PM
> Have you ever restarted your cluster after kerberization?
Yes. I even restarted the operating system at all nodes.
> The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands
It's a CentOS Linux release 7.3.1611 (Core).
Here are the contents of Kerberos configuration files:
1) /etc/krb5.conf
[libdefaults] renew_lifetime = 7d forwardable = true default_realm = MYHDP.COM.BR ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false default_ccache_name = /tmp/krb5cc_%{uid} #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 [domain_realm] MYHDP.COM.BR = MYHDP.COM.BR [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log [realms] MYHDP.COM.BR = { admin_server = 10.30.242.46 kdc = 10.30.242.46 }
2) /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults] kdc_ports = 88 kdc_tcp_ports = 88 [realms] MYHDP.COM.BR = { #master_key_type = aes256-cts acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal }
3) /var/kerberos/krb5kdc/kadm5.acl
*/admin@MYHDP.COM.BR *
And here are the Kerberos services statuses:
4) kadmin
[root@dfcdsrvbi0045 ~]# service kadmin status Redirecting to /bin/systemctl status kadmin.service ● kadmin.service - Kerberos 5 Password-changing and Administration Loaded: loaded (/usr/lib/systemd/system/kadmin.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago Main PID: 1998 (kadmind) CGroup: /system.slice/kadmin.service └─1998 /usr/sbin/kadmind -P /var/run/kadmind.pid
5) krb5kdc
[root@dfcdsrvbi0045 ~]# service krb5kdc status Redirecting to /bin/systemctl status krb5kdc.service ● krb5kdc.service - Kerberos 5 KDC Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago Main PID: 1950 (krb5kdc) CGroup: /system.slice/krb5kdc.service └─1950 /usr/sbin/krb5kdc -P /var/run/krb5kdc.pid
> Make sure NTPD is running so the time in your cluster is in sync.
I've already manually set all times in the cluster nodes to a given NTP server through "ntpdate" (see output below).
Moreover, I haven't found any "clock skew too great" messages in the KDC log files.
[root@dfcdsrvbi0045 ~]# ./ntpdate-all.sh Comando: ntpdate -u dfcddom0002.otherdomain [10.30.242.11] 30 May 14:59:46 ntpdate[46569]: adjust time server 10.30.14.150 offset 0.017012 sec [10.30.242.12] 30 May 14:59:53 ntpdate[8593]: adjust time server 10.30.14.150 offset 0.012909 sec [10.30.242.13] 30 May 14:59:59 ntpdate[64344]: adjust time server 10.30.14.151 offset 0.025877 sec [10.30.242.22] 30 May 15:00:06 ntpdate[8161]: adjust time server 10.30.14.150 offset 0.001915 sec [10.30.242.24] 30 May 15:00:12 ntpdate[79351]: adjust time server 10.30.14.150 offset 0.025769 sec [10.30.242.25] 30 May 15:00:19 ntpdate[43070]: adjust time server 10.30.14.150 offset 0.011081 sec [10.30.242.28] 30 May 15:00:26 ntpdate[76036]: adjust time server 10.30.14.150 offset 0.016353 sec [10.30.242.43] 30 May 15:00:32 ntpdate[25763]: adjust time server 10.30.14.151 offset 0.007244 sec [10.30.242.44] 30 May 15:00:39 ntpdate[16081]: adjust time server 10.30.14.151 offset -0.001653 sec [10.30.242.45] 30 May 15:00:45 ntpdate[9834]: adjust time server 10.30.14.150 offset 0.012155 sec [10.30.242.46] 30 May 15:00:51 ntpdate[10009]: adjust time server 10.30.14.151 offset 0.011018 sec [10.30.242.50] 30 May 15:00:58 ntpdate[63572]: adjust time server 10.30.14.150 offset 0.005541 sec [10.30.242.51] 30 May 15:01:05 ntpdate[24894]: adjust time server 10.30.14.151 offset -0.008577 sec [10.30.242.53] 30 May 15:01:05 ntpdate[24813]: adjust time server 10.30.14.150 offset 0.024188 sec
The rest of the components of HDFS service (journal nodes, data nodes, and name nodes) seem functional, as illustrated in the picture below:
The problem is that both Name Nodes are stuck in "standby" mode and ZKFailoverControllers are offline.
Therefore, I couldn't start other cluster services dependent of HDFS (i.e., YARN, Hive, etc).
Any clue?
Created 05-30-2018 07:38 PM
The files look okay except the krb5.conf notice the entry [domain_realm] please change it on the kdc and copy it to all the nodes and restart the KDC services(kadmin & Krb5kdc), such a simple thing could be the problem
[libdefaults] renew_lifetime = 7d forwardable = true default_realm = MYHDP.COM.BR ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false default_ccache_name = /tmp/krb5cc_%{uid} #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 [domain_realm] .myhdp.com.br = MYHDP.COM.BR myhdp.com.br = MYHDP.COM.BR [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log [realms] MYHDP.COM.BR = { admin_server = 10.30.242.46 kdc = 10.30.242.46 }
How is your /etc/hosts entry? No mixed cases
Why is Ranger,Yarn,MR ,AMS disabled/maintenance mode? Had you enabled ranger plugins?
Please revert
Created 10-05-2018 05:41 PM
This could be KDC server issue. Please check if you are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1560951
Created 10-05-2018 10:52 PM
Any updates on this thread, it should be long closed. Please take time and accept a provided solution
Created 02-28-2019 08:25 AM
Hi @Rodrigo Hjort ,
did you solve this problem and if yes, how ?