Reply
New Contributor
Posts: 5
Registered: ‎08-18-2014
Accepted Solution

ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

[ Edited ]

Hi,

Recently we are experiencing RM crashes and we see the following error in the log:

 

Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
    at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
    at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)

 

 

We also get a lot of these exceptions in the Resource Manager log:

java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-04-30 22:53:51,669 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot/RMDelegationToken_57967
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:999)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeRMDelegationTokenState(ZKRMStateStore.java:737)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.removeRMDelegationToken(RMStateStore.java:668)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:142)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:49)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:605)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656)
        at java.lang.Thread.run(Thread.java:745)

 

Network is fine it terms of errors/packet drops CPU usage is very low on ZK servers.

We are using CDH 5.3.1.

 

Thank you,

Michael.

Cloudera Employee
Posts: 314
Registered: ‎01-16-2014

Re: ResourceManager crashes due to KeeperErrorCode = ConnectionLoss

There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).

I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.

 

Wilfred

Announcements