Support Questions

Find answers, ask questions, and share your expertise

YARN ResourceManager service failed to start after enabling Kerberos

avatar
Contributor

Hello,

after enabling Kerberos the YARN ResourceManager failed to start. This is the content from log file:

2017-11-06 12:11:58,708 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1232)) - Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /rmstore
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:593)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1049)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1045)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1085)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
Caused by: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /rmstore
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:326)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$1.run(ZKRMStateStore.java:322)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDir(ZKRMStateStore.java:336)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createRootDirRecursively(ZKRMStateStore.java:1311)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.startInternal(ZKRMStateStore.java:303)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStart(RMStateStore.java:598)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        ... 12 more


It seems to be an issue with Zookeeper. If I execute zkCli.sh on the node where ResourceManager is installed the message "AUTH_FAILED" is displayed:

$ /usr/hdp/2.6.3.0-235/zookeeper/bin/zkCli.sh
Connecting to localhost:2181
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to ZooKeeper!
JLine support is enabled
[zk: localhost:2181(CONNECTING) 0]
WATCHER::


WatchedEvent state:SyncConnected type:None path:null


WATCHER::


WatchedEvent state:AuthFailed type:None path:null


[zk: localhost:2181(AUTH_FAILED) 0]


zkCli.sh in the other nodes is working fine:

$ /usr/hdp/2.6.3.0-235/zookeeper/bin/zkCli.sh
Connecting to localhost:2181
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to ZooKeeper!
JLine support is enabled


WATCHER::


WatchedEvent state:AuthFailed type:None path:null
[zk: localhost:2181(CONNECTING) 0]
WATCHER::


WatchedEvent state:SyncConnected type:None path:null


[zk: localhost:2181(CONNECTED) 0]

Do you have any idea about how to troubleshoot this issue?

Many thanks in advance,

Jorge.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Jorge Florencio,

Please check the value of "yarn.resourcemanager.zk-address" in Yarn configs (Yarn -> Config -> Advanced -> Fault tolerance)

If it is set to localhost:2181 , then change it to <zk-host>:2181 and try restarting the components. Additionally perform forward and reverse DNS lookup of the hostname where RM is running.

Also, for the AUTH_FAILED in zkCli.sh is due to the zookeeper address not being passed in the command. If -server option is not passed, it assumes that zookeeper is running on local which is not the case for you. You can try running

./zkCli.sh -server <zk-host>:2181 

Thanks,

Aditya

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Jorge Florencio,

Please check the value of "yarn.resourcemanager.zk-address" in Yarn configs (Yarn -> Config -> Advanced -> Fault tolerance)

If it is set to localhost:2181 , then change it to <zk-host>:2181 and try restarting the components. Additionally perform forward and reverse DNS lookup of the hostname where RM is running.

Also, for the AUTH_FAILED in zkCli.sh is due to the zookeeper address not being passed in the command. If -server option is not passed, it assumes that zookeeper is running on local which is not the case for you. You can try running

./zkCli.sh -server <zk-host>:2181 

Thanks,

Aditya

avatar
Contributor

Hi Aditya,

I've checked the DNS, forward and reverse, and I've seen that "hostname -f" doesn't display fqdn. After solving this issue, all services are up and running.

Thank you!

Jorge.