Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

SOLVED Go to solution

ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

New Contributor

Hi,

Recently we are experiencing RM crashes and we see the following error in the log:

 

Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
    at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)

 

 

We also get a lot of these exceptions in the Resource Manager log:

java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-04-30 22:53:51,669 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot/RMDelegationToken_57967
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:999)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeRMDelegationTokenState(ZKRMStateStore.java:737)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.removeRMDelegationToken(RMStateStore.java:668)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:142)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:49)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:605)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656)
        at java.lang.Thread.run(Thread.java:745)

 

Network is fine it terms of errors/packet drops CPU usage is very low on ZK servers.

We are using CDH 5.3.1.

 

Thank you,

Michael.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

Super Collaborator

There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).

I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.

 

Wilfred

1 REPLY 1
Highlighted

Re: ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

Super Collaborator

There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).

I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.

 

Wilfred