Member since
05-30-2018
2
Posts
0
Kudos Received
0
Solutions
05-30-2018
06:49 PM
@Geoffrey Shelton Okot > Have you ever restarted your cluster after kerberization? Yes. I even restarted the operating system at all nodes. > The assumption here is your OS is RHEL/CentOS/Oracle Linux 6,if not please adapt the commands It's a CentOS Linux release 7.3.1611 (Core). Here are the contents of Kerberos configuration files: 1) /etc/krb5.conf [libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = MYHDP.COM.BR
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[domain_realm]
MYHDP.COM.BR = MYHDP.COM.BR
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
MYHDP.COM.BR = {
admin_server = 10.30.242.46
kdc = 10.30.242.46
} 2) /var/kerberos/krb5kdc/kdc.conf [kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
MYHDP.COM.BR = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
} 3) /var/kerberos/krb5kdc/kadm5.acl */admin@MYHDP.COM.BR * And here are the Kerberos services statuses: 4) kadmin [root@dfcdsrvbi0045 ~]# service kadmin status
Redirecting to /bin/systemctl status kadmin.service
● kadmin.service - Kerberos 5 Password-changing and Administration
Loaded: loaded (/usr/lib/systemd/system/kadmin.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
Main PID: 1998 (kadmind)
CGroup: /system.slice/kadmin.service
└─1998 /usr/sbin/kadmind -P /var/run/kadmind.pid 5) krb5kdc [root@dfcdsrvbi0045 ~]# service krb5kdc status
Redirecting to /bin/systemctl status krb5kdc.service
● krb5kdc.service - Kerberos 5 KDC
Loaded: loaded (/usr/lib/systemd/system/krb5kdc.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2018-05-28 10:19:07 -03; 2 days ago
Main PID: 1950 (krb5kdc)
CGroup: /system.slice/krb5kdc.service
└─1950 /usr/sbin/krb5kdc -P /var/run/krb5kdc.pid > Make sure NTPD is running so the time in your cluster is in sync. I've already manually set all times in the cluster nodes to a given NTP server through "ntpdate" (see output below). Moreover, I haven't found any "clock skew too great" messages in the KDC log files. [root@dfcdsrvbi0045 ~]# ./ntpdate-all.sh
Comando: ntpdate -u dfcddom0002.otherdomain
[10.30.242.11]
30 May 14:59:46 ntpdate[46569]: adjust time server 10.30.14.150 offset 0.017012 sec
[10.30.242.12]
30 May 14:59:53 ntpdate[8593]: adjust time server 10.30.14.150 offset 0.012909 sec
[10.30.242.13]
30 May 14:59:59 ntpdate[64344]: adjust time server 10.30.14.151 offset 0.025877 sec
[10.30.242.22]
30 May 15:00:06 ntpdate[8161]: adjust time server 10.30.14.150 offset 0.001915 sec
[10.30.242.24]
30 May 15:00:12 ntpdate[79351]: adjust time server 10.30.14.150 offset 0.025769 sec
[10.30.242.25]
30 May 15:00:19 ntpdate[43070]: adjust time server 10.30.14.150 offset 0.011081 sec
[10.30.242.28]
30 May 15:00:26 ntpdate[76036]: adjust time server 10.30.14.150 offset 0.016353 sec
[10.30.242.43]
30 May 15:00:32 ntpdate[25763]: adjust time server 10.30.14.151 offset 0.007244 sec
[10.30.242.44]
30 May 15:00:39 ntpdate[16081]: adjust time server 10.30.14.151 offset -0.001653 sec
[10.30.242.45]
30 May 15:00:45 ntpdate[9834]: adjust time server 10.30.14.150 offset 0.012155 sec
[10.30.242.46]
30 May 15:00:51 ntpdate[10009]: adjust time server 10.30.14.151 offset 0.011018 sec
[10.30.242.50]
30 May 15:00:58 ntpdate[63572]: adjust time server 10.30.14.150 offset 0.005541 sec
[10.30.242.51]
30 May 15:01:05 ntpdate[24894]: adjust time server 10.30.14.151 offset -0.008577 sec
[10.30.242.53]
30 May 15:01:05 ntpdate[24813]: adjust time server 10.30.14.150 offset 0.024188 sec The rest of the components of HDFS service (journal nodes, data nodes, and name nodes) seem functional, as illustrated in the picture below: The problem is that both Name Nodes are stuck in "standby" mode and ZKFailoverControllers are offline. Therefore, I couldn't start other cluster services dependent of HDFS (i.e., YARN, Hive, etc). Any clue?
... View more
05-30-2018
11:18 AM
After restarting a working HDP through Ambari, both two NameNodes entered in standby mode. I also noted that ZKFailoverControllers failed to raised up. Then, I performed these steps (through Ambari): stopped all services started only ZooKeeper service (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045) started all Journal Nodes (3 servers: dfcdsrvbi0042, dfcdsrvbi0043, dfcdsrvbi0045) In order to raise ZKFC manually, I tried the command below (using "hdfs" user in the Name Node "dfcdsrvbi0042" I elected to be the primary): [hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/current/hadoop-hdfs-journalnode/../hadoop/sbin/hadoop-daemon.sh start zkfc It had no success, giving the following results in the LOG (/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-dfcdsrvbi0042.localdomain.log): STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG: user = hdfs
STARTUP_MSG: host = dfcdsrvbi0042.localdomain/161.148.242.43
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.7.3.2.6.0.3-8
...
2018-05-29 15:18:02,724 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-05-29 15:18:03,148 INFO tools.DFSZKFailoverController (DFSZKFailoverController.java:<init>(153)) - Failover controller configured for NameNode NameNode at dfcdsrv
bi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,476 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1101)) - Login successful for user nn/dfcdsrvbi0042.localdoma
in@MYHDP.COM.BR using keytab file /etc/security/keytabs/nn.service.keytab
2018-05-29 15:18:03,505 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=dfcdsrvbi0042.localdomain:2181,dfcdsrvbi004
3.localdomain:2181,dfcdsrvbi0045.localdomain:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@76b1e9b8
2018-05-29 15:18:03,527 INFO zookeeper.Login (Login.java:login(294)) - successfully logged in.
2018-05-29 15:18:03,529 INFO zookeeper.Login (Login.java:run(127)) - TGT refresh thread started.
2018-05-29 15:18:03,531 INFO client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism.
2018-05-29 15:18:03,537 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server dfcdsrvbi0045.localdomain/161.148.242.4
6:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(302)) - TGT valid starting at: Tue May 29 15:18:03 BRT 2018
2018-05-29 15:18:03,538 INFO zookeeper.Login (Login.java:getRefreshTime(303)) - TGT expires: Sun Feb 07 04:28:15 BRST 2106
2018-05-29 15:18:03,539 INFO zookeeper.Login (Login.java:run(181)) - TGT refresh sleeping until: Fri Jan 25 11:27:36 BRST 2092
2018-05-29 15:18:03,543 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established to dfcdsrvbi0045.localdomain/161.148.242.46:2181, initiating session
2018-05-29 15:18:03,550 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server dfcdsrvbi0045.localdomain/161.148.242.46:2181, sessionid = 0x363acc9fc850003, negotiated timeout = 6000
2018-05-29 15:18:03,552 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
2018-05-29 15:18:03,559 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,559 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
2018-05-29 15:18:03,665 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=2075809815 appData=null cb=Elector callbacks for NameNode at dfcdsrvbi0042.localdomain/161.148.242.43:8020
2018-05-29 15:18:03,665 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed
2018-05-29 15:18:03,665 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED
2018-05-29 15:18:03,666 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now
java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/MYHDPNNHA'
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:319)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:218)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:474)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191)
Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hadoop-ha/MYHDPNNHA
at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:317)
... 10 more
2018-05-28 15:50:12,167 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dfcdsrvbi0042.localdomain/161.148.242.43
I even tried to reset the respective ZKFC data by issuing this instruction, but the same error occurred: [hdfs@dfcdsrvbi0042 ~]$ hdfs zkfc -formatZK -force In the KDC log (/var/log/krb5kdc.log in Kerberos), I notted these entries: May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: ISSUE: authtime 1527622154, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:29:14 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (4 etypes {18 17 16 23}) 161.148.242.43: PROCESS_TGS: authtime 0, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR, Ticket expired
Then I tried to manually renew the tickets by executing this (using "hdfs" user in the same Name Node "dfcdsrvbi0042"): kdestroy
kinit -V -kt /etc/security/keytabs/nn.service.keytab nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
kvno zookeeper/dfcdsrvbi0042.localdomain
kvno zookeeper/dfcdsrvbi0043.localdomain
kvno zookeeper/dfcdsrvbi0045.localdomain
klist Here were the results of "klist": [hdfs@dfcdsrvbi0042 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1031
Default principal: nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
Valid starting Expires Service principal
05/29/2018 16:35:21 05/30/2018 16:35:21 krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
05/29/2018 16:35:21 05/30/2018 16:35:21 zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR Those later commands were reflected in the KDC log: May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): AS_REQ (8 etypes {18 17 16 23 20 19 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for krbtgt/MYHDP.COM.BR@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0043.localdomain@MYHDP.COM.BR
May 29 16:36:11 dfcdsrvbi0045.localdomain krb5kdc[1950](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 161.148.242.43: ISSUE: authtime 1527622571, etypes {rep=18 tkt=18 ses=18}, nn/dfcdsrvbi0042.localdomain@MYHDP.COM.BR for zookeeper/dfcdsrvbi0045.localdomain@MYHDP.COM.BR I also tried modifying "zookeeper_client_jaas.conf" (in /etc/zookeeper/2.6.0.3-8/0/ at the same "dfcdsrvbi0042" server) using these settings: Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=true
keyTab="/etc/security/keytabs/zk.service.keytab"
principal="zookeeper/dfcdsrvbi0042.localdomain@MYHDP.COM.BR";
}; But nothing new happened. I could manually connect to the ZooKeeper service and retrieve the contents of the "/hadoop-ha/MYHDPNNHA" znode ZKFC was supposed to read: [hdfs@dfcdsrvbi0042 ~]$ /usr/hdp/2.6.0.3-8/zookeeper/bin/zkCli.sh -server dfcdsrvbi0042.localdomain
[zk: dfcdsrvbi0042.localdomain(CONNECTED) 1] ls /
[hive, registry, zookeeper, infra-solr, hiveserver2-hive2, yarn-leader-election, hadoop-ha, services, llap-sasl, templeton-hadoop, zkdtsm_mp1_llap0, zkdtsm_hive_llap0, hiveserver2, rmstore, ambari-metrics-cluster, zkdtsm_04817003669_llap0]
[zk: dfcdsrvbi0042.localdomain(CONNECTED) 7] stat /hadoop-ha/MYHDPNNHA
cZxid = 0x200000007
ctime = Fri Apr 14 17:32:03 BRT 2017
mZxid = 0x200000007
mtime = Fri Apr 14 17:32:03 BRT 2017
pZxid = 0x2f0003c87a
cversion = 345
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 1 Therefore, I have no more ideas of what to try to restore my HDFS. Please, what else should I do? Any help is appreciated. Thanks in advance!
... View more
Labels:
- Labels:
-
Apache Hadoop