Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

avatar
Explorer

Hi,

Recently we are experiencing RM crashes and we see the following error in the log:

 

Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
    at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)

 

 

We also get a lot of these exceptions in the Resource Manager log:

java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-04-30 22:53:51,669 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot/RMDelegationToken_57967
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:999)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeRMDelegationTokenState(ZKRMStateStore.java:737)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.removeRMDelegationToken(RMStateStore.java:668)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:142)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:49)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:605)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656)
        at java.lang.Thread.run(Thread.java:745)

 

Network is fine it terms of errors/packet drops CPU usage is very low on ZK servers.

We are using CDH 5.3.1.

 

Thank you,

Michael.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).

I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.

 

Wilfred

View solution in original post

1 REPLY 1

avatar
Super Collaborator

There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).

I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.

 

Wilfred