Created 11-06-2017 11:39 AM
Hello,
after enabling Kerberos the YARN ResourceManager failed to start. This is the content from log file:
2017-11-06 12:11:58,708 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1232)) - Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /rmstore at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:593) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1049) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1045) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /rmstore at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:326) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:322) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDir(ZKRMStateStore.java:336) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDirRecursively(ZKRMStateStore.java:1311) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:303) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:598) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 12 more
It seems to be an issue with Zookeeper. If I execute zkCli.sh on the node where ResourceManager is installed the message "AUTH_FAILED" is displayed:
$ /usr/hdp/2.6.3.0-235/zookeeper/bin/zkCli.sh Connecting to localhost:2181 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Welcome to ZooKeeper! JLine support is enabled [zk: localhost:2181(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null WATCHER:: WatchedEvent state:AuthFailed type:None path:null [zk: localhost:2181(AUTH_FAILED) 0]
zkCli.sh in the other nodes is working fine:
$ /usr/hdp/2.6.3.0-235/zookeeper/bin/zkCli.sh Connecting to localhost:2181 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Welcome to ZooKeeper! JLine support is enabled WATCHER:: WatchedEvent state:AuthFailed type:None path:null [zk: localhost:2181(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: localhost:2181(CONNECTED) 0]
Do you have any idea about how to troubleshoot this issue?
Many thanks in advance,
Jorge.
Created 11-06-2017 02:53 PM
Please check the value of "yarn.resourcemanager.zk-address" in Yarn configs (Yarn -> Config -> Advanced -> Fault tolerance)
If it is set to localhost:2181 , then change it to <zk-host>:2181 and try restarting the components. Additionally perform forward and reverse DNS lookup of the hostname where RM is running.
Also, for the AUTH_FAILED in zkCli.sh is due to the zookeeper address not being passed in the command. If -server option is not passed, it assumes that zookeeper is running on local which is not the case for you. You can try running
./zkCli.sh -server <zk-host>:2181
Thanks,
Aditya
Created 11-06-2017 02:53 PM
Please check the value of "yarn.resourcemanager.zk-address" in Yarn configs (Yarn -> Config -> Advanced -> Fault tolerance)
If it is set to localhost:2181 , then change it to <zk-host>:2181 and try restarting the components. Additionally perform forward and reverse DNS lookup of the hostname where RM is running.
Also, for the AUTH_FAILED in zkCli.sh is due to the zookeeper address not being passed in the command. If -server option is not passed, it assumes that zookeeper is running on local which is not the case for you. You can try running
./zkCli.sh -server <zk-host>:2181
Thanks,
Aditya
Created 11-06-2017 03:20 PM
Hi Aditya,
I've checked the DNS, forward and reverse, and I've seen that "hostname -f" doesn't display fqdn. After solving this issue, all services are up and running.
Thank you!
Jorge.