Created on 05-04-2015 02:05 AM - edited 09-16-2022 02:27 AM
Hi,
Recently we are experiencing RM crashes and we see the following error in the log:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
We also get a lot of these exceptions in the Resource Manager log:
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-04-30 22:53:51,669 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot/RMDelegationToken_57967
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:999)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeRMDelegationTokenState(ZKRMStateStore.java:737)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.removeRMDelegationToken(RMStateStore.java:668)
at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:142)
at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:49)
at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.removeExpiredToken(AbstractDelegationTokenSecretManager.java:605)
at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.access$400(AbstractDelegationTokenSecretManager.java:54)
at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:656)
at java.lang.Thread.run(Thread.java:745)
Network is fine it terms of errors/packet drops CPU usage is very low on ZK servers.
We are using CDH 5.3.1.
Thank you,
Michael.
Created 05-13-2015 06:38 PM
There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).
I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.
Wilfred
Created 05-13-2015 06:38 PM
There have been a number of issues in the RM with relation to ZooKeeper connections. There is at least a couple of issue fixed in CDH 5.3.3 (YARN-3242, YARN-2992).
I am not sure if your case is fully covered by these fixes since we are still working on one or two fixes in this area but upgrading to CDH 5.3.3 will help with a number of these ZK issues in the RM.
Wilfred