Created 05-30-2018 11:18 AM
After restarting a working HDP through Ambari, both two NameNodes entered in standby mode.
I also noted that ZKFailoverControllers failed to raised up.
Then, I performed these steps (through Ambari):
In order to raise ZKFC manually, I tried the command below (using "hdfs" user in the Name Node "dfcdsrvbi0042" I elected to be the primary):
[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start zkfc
It had no success, giving the following results in the LOG (/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-dfcdsrvbi0042.localdomain.log):
STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG: user = hdfs
STARTUP_MSG: host = dfcdsrvbi0042.localdomain/161.148.242.43
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.7.3.2.6.0.3-8
...
2018-05-29 15:18:02,724 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-05-29 15:18:03,148 INFO tools.DFSZKFailoverController (DFSZKFailoverController.java:<init>(153)) - Failover controller configured for NameNode NameNode at dfcdsrv
bi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,476 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1101)) - Login successful for user nn/dfcdsrvbi0042.localdoma
in@MYHDP.COM.BR using keytab file /etc/security/keytabs/nn.service.keytab
2018-05-29 15:18:03,505 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=dfcdsrvbi0042.localdomain:2181,dfcdsrvbi004
3.localdomain:2181,dfcdsrvbi0045.localdomain:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@76b1e9b8
2018-05-29 15:18:03,527 INFO zookeeper.Login (Login.java:login(294)) - successfully logged in.
2018-05-29 15:18:03,529 INFO zookeeper.Login (Login.java:run(127)) - TGT refresh thread started.
2018-05-29 15:18:03,531 INFO client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism.
2018-05-29 15:18:03,537 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server dfcdsrvbi0045.localdomain/161.148.242.4
6:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(302)) - TGT valid starting at: Tue May 29 15:18:03 BRT 2018
2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(303)) - TGT expires: Sun Feb 07 04:28:15 BRST 2106
2018-05-29 15:18:03,539 INFO zookeeper.Login (Login.java:run(181)) - TGT refresh sleeping until: Fri Jan 25 11:27:36 BRST 2092
2018-05-29 15:18:03,543 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to dfcdsrvbi0045.localdomain/161.148.242.46:2181, initiating session
2018-05-29 15:18:03,550 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server dfcdsrvbi0045.localdomain/161.148.242.46:2181, sessionid = 0x363acc9fc850003, negotiated timeout = 6000
2018-05-29 15:18:03,552 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
2018-05-29 15:18:03,559 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,559 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,665 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=2075809815 appData=null cb=Elector callbacks for NameNode at dfcdsrvbi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,665 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED
2018-05-29 15:18:03,666 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now
java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/MYHDPNNHA'
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:319)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:218)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:474)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191)
Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hadoop-ha/MYHDPNNHA
at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:317)
... 10 more
2018-05-28 15:50:12,167 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dfcdsrvbi0042.localdomain/161.148.242.43
I even tried to reset the respective ZKFC data by issuing this instruction, but the same error occurred:
[hdfs@dfcdsrvbi0042 ~]$ hdfs zkfc -formatZK -force
In the KDC log (/var/log/krb5kdc.log in Kerberos), I notted these entries:
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: PROCESS_TGS: authtime 0, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR, Ticket expired
Then I tried to manually renew the tickets by executing this (using "hdfs" user in the same Name Node "dfcdsrvbi0042"):
kdestroy kinit -V -kt /etc/security/keytabs/nn.service.keytab nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR kvno zookeeper/dfcdsrvbi0042.localdomain kvno zookeeper/dfcdsrvbi0043.localdomain kvno zookeeper/dfcdsrvbi0045.localdomain klist
Here were the results of "klist":
[hdfs@dfcdsrvbi0042 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_1031 Default principal: nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR Valid starting Expires Service principal 05/29/2018 16:35:21 05/30/2018 16:35:21 krbtgt/MYHDP.COM.BR@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR 05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR
Those later commands were reflected in the KDC log:
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (8 etypes {18 17 16 23 20 19 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BRI also tried modifying "zookeeper_client_jaas.conf" (in /etc/zookeeper/2.6.0.3-8/0/ at the same "dfcdsrvbi0042" server) using these settings:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=true
keyTab="/etc/security/keytabs/zk.service.keytab"
principal="zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR";
};But nothing new happened.
I could manually connect to the ZooKeeper service and retrieve the contents of the "/hadoop-ha/MYHDPNNHA" znode ZKFC was supposed to read:
[hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/2.6.0.3-8/zookeeper/bin/zkCli.sh -server dfcdsrvbi0042.localdomain [zk: dfcdsrvbi0042.localdomain(CONNECTED) 1] ls / [hive, registry, zookeeper, infra-solr, hiveserver2-hive2, yarn-leader-election, hadoop-ha, services, llap-sasl, templeton-hadoop, zkdtsm_mp1_llap0, zkdtsm_hive_llap0, hiveserver2, rmstore, ambari-metrics-cluster, zkdtsm_04817003669_llap0] [zk: dfcdsrvbi0042.localdomain(CONNECTED) 7] stat /hadoop-ha/MYHDPNNHA cZxid = 0x200000007 ctime = Fri Apr 14 17:32:03 BRT 2017 mZxid = 0x200000007 mtime = Fri Apr 14 17:32:03 BRT 2017 pZxid = 0x2f0003c87a cversion = 345 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 1
Therefore, I have no more ideas of what to try to restore my HDFS.
Please, what else should I do?
Any help is appreciated.
Thanks in advance!
Created 05-30-2018 03:56 PM
Have you ever restarted your cluster after kerberization?
The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands
Can you share the contents of
Make sure the Kerberos daemons are up and running
# kadmin status # krb5kdc status
If down
# service kadmin start # service krb5kdc start
Make sure NTPD is running so the time in your cluster is in sync.
To sum up, the ticket lifetime is the minimum of the following values:
Please revert
Created on 05-30-2018 06:49 PM - edited 08-17-2019 08:49 PM
> Have you ever restarted your cluster after kerberization?
Yes. I even restarted the operating system at all nodes.
> The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands
It's a CentOS Linux release 7.3.1611 (Core).
Here are the contents of Kerberos configuration files:
1) /etc/krb5.conf
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = MYHDP.COM.BR
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[domain_realm]
MYHDP.COM.BR = MYHDP.COM.BR
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
MYHDP.COM.BR = {
admin_server = 10.30.242.46
kdc = 10.30.242.46
}
2) /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
MYHDP.COM.BR = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}
3) /var/kerberos/krb5kdc/kadm5.acl
*/admin@MYHDP.COM.BR *
And here are the Kerberos services statuses:
4) kadmin
[root@dfcdsrvbi0045 ~]# service kadmin status
Redirecting to /bin/systemctl status kadmin.service
● kadmin.service - Kerberos 5 Password-changing and Administration
Loaded: loaded (/usr/lib/systemd/system/kadmin.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
Main PID: 1998 (kadmind)
CGroup: /system.slice/kadmin.service
└─1998 /usr/sbin/kadmind -P /var/run/kadmind.pid
5) krb5kdc
[root@dfcdsrvbi0045 ~]# service krb5kdc status
Redirecting to /bin/systemctl status krb5kdc.service
● krb5kdc.service - Kerberos 5 KDC
Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
Main PID: 1950 (krb5kdc)
CGroup: /system.slice/krb5kdc.service
└─1950 /usr/sbin/krb5kdc -P /var/run/krb5kdc.pid
> Make sure NTPD is running so the time in your cluster is in sync.
I've already manually set all times in the cluster nodes to a given NTP server through "ntpdate" (see output below).
Moreover, I haven't found any "clock skew too great" messages in the KDC log files.
[root@dfcdsrvbi0045 ~]# ./ntpdate-all.sh Comando: ntpdate -u dfcddom0002.otherdomain [10.30.242.11] 30 May 14:59:46 ntpdate[46569]: adjust time server 10.30.14.150 offset 0.017012 sec [10.30.242.12] 30 May 14:59:53 ntpdate[8593]: adjust time server 10.30.14.150 offset 0.012909 sec [10.30.242.13] 30 May 14:59:59 ntpdate[64344]: adjust time server 10.30.14.151 offset 0.025877 sec [10.30.242.22] 30 May 15:00:06 ntpdate[8161]: adjust time server 10.30.14.150 offset 0.001915 sec [10.30.242.24] 30 May 15:00:12 ntpdate[79351]: adjust time server 10.30.14.150 offset 0.025769 sec [10.30.242.25] 30 May 15:00:19 ntpdate[43070]: adjust time server 10.30.14.150 offset 0.011081 sec [10.30.242.28] 30 May 15:00:26 ntpdate[76036]: adjust time server 10.30.14.150 offset 0.016353 sec [10.30.242.43] 30 May 15:00:32 ntpdate[25763]: adjust time server 10.30.14.151 offset 0.007244 sec [10.30.242.44] 30 May 15:00:39 ntpdate[16081]: adjust time server 10.30.14.151 offset -0.001653 sec [10.30.242.45] 30 May 15:00:45 ntpdate[9834]: adjust time server 10.30.14.150 offset 0.012155 sec [10.30.242.46] 30 May 15:00:51 ntpdate[10009]: adjust time server 10.30.14.151 offset 0.011018 sec [10.30.242.50] 30 May 15:00:58 ntpdate[63572]: adjust time server 10.30.14.150 offset 0.005541 sec [10.30.242.51] 30 May 15:01:05 ntpdate[24894]: adjust time server 10.30.14.151 offset -0.008577 sec [10.30.242.53] 30 May 15:01:05 ntpdate[24813]: adjust time server 10.30.14.150 offset 0.024188 sec
The rest of the components of HDFS service (journal nodes, data nodes, and name nodes) seem functional, as illustrated in the picture below:
The problem is that both Name Nodes are stuck in "standby" mode and ZKFailoverControllers are offline.
Therefore, I couldn't start other cluster services dependent of HDFS (i.e., YARN, Hive, etc).
Any clue?
Created 05-30-2018 07:38 PM
The files look okay except the krb5.conf notice the entry [domain_realm] please change it on the kdc and copy it to all the nodes and restart the KDC services(kadmin & Krb5kdc), such a simple thing could be the problem
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = MYHDP.COM.BR
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[domain_realm]
.myhdp.com.br = MYHDP.COM.BR
myhdp.com.br = MYHDP.COM.BR
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
MYHDP.COM.BR = {
admin_server = 10.30.242.46
kdc = 10.30.242.46
}How is your /etc/hosts entry? No mixed cases
Why is Ranger,Yarn,MR ,AMS disabled/maintenance mode? Had you enabled ranger plugins?
Please revert
Created 10-05-2018 05:41 PM
This could be KDC server issue. Please check if you are hitting this: https://bugzilla.redhat.com/show_bug.cgi?id=1560951
Created 10-05-2018 10:52 PM
Any updates on this thread, it should be long closed. Please take time and accept a provided solution
Created 02-28-2019 08:25 AM
Hi @Rodrigo Hjort ,
did you solve this problem and if yes, how ?