Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn Resource Manager HA is failed along with all Node manager service went down.

avatar
New Contributor

We have Yarn HA enabled cluster with 17 Data nodes. The active Resource Manager service went down & didnt failover to stand by, which eventually made all the 17 NodeManager services failed.Also both the Active and Standby resource service went down. Attaching the required yarn logs from both Active and stand by nodes and 1 data node log which similar across all the 17 Datanodes.Please help out.Thank you.

Active Resource Manager Yarn Logs :

2018-12-18 20:05:22,193 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1146)) - Session 0x2675d4e5b3b0046 for server prdhdpdn1.example.com/<MN1_IP>:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2018-12-18 20:05:22,266 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onApplicationAttemptUnregistered(107)) - Application attempt appattempt_1541703830010_14756_000001 unregistered, purging app attempt records
2018-12-18 20:05:22,266 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) -  records under / with ID appattempt_1541703830010_14756_000001 and policy application-attempt: {}
2018-12-18 20:05:22,296 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1227)) - Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)


2018-12-18 20:05:22,296 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1230)) - Retrying operation on ZK. Retry no. 26
2018-12-18 20:05:22,339 INFO  ipc.Server (Server.java:logException(2401)) - IPC Server handler 43 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.106.8.107:40200 Call#39846 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1541703830010_14809' doesn't exist in RM.
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:331)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)


Stand by Resource Manager Yarn Logs :

2018-12-18 19:52:29,115 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1227)) - Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)

DataNode 1 - Yarn Log :

2018-12-18 20:04:03,833 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(267)) - Exception while invoking ResourceTrackerPBClientImpl.nodeHeartbeat over rm2. Trying to failover immediately.
java.io.EOFException: End of File Exception between local host is: "prdhdpdn1.example.com/10.106.8.145"; destination host is: "prdhdpmn2.example.com":8031; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1556)
        at org.apache.hadoop.ipc.Client.call(Client.java:1496)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy87.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
        at com.sun.proxy.$Proxy88.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:701)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1117)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1012)
2018-12-18 20:04:03,834 INFO  client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2018-12-18 20:04:03,835 WARN  ipc.Client (Client.java:handleConnectionFailure(886)) - Failed to connect to server: prdhdpmn1.example.com/<MN1_IP>:8031: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

DataNode 1 - Zookeeper Log :

2018-12-18 19:52:30,548 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /<DN1 IP>:57446 which had sessionid 0x1646a85ed0c0131
2018-12-18 19:52:33,665 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /<DN1 IP>:57464
2018-12-18 19:52:33,666 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x1646a85ed0c0131 at /<DN1 IP>:57464
2018-12-18 19:52:33,666 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x1646a85ed0c0131
2018-12-18 19:52:33,666 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x1646a85ed0c0131 with negotiated timeout 10000 for client /<DN1 IP>:57464
2018-12-18 19:52:33,828 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@118] - Successfully authenticated client: authenticationID=rm/prdhdpmn2.example.com@example.com;  authorizationID=rm/prdhdpmn2.example.com@example.com.
2018-12-18 19:52:33,828 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@134] - Setting authorizedID: rm
2018-12-18 19:52:33,828 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@964] - adding SASL authorization for authorizationID: rm
2018-12-18 19:52:33,828 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /<DN1 IP>:57464
2018-12-18 19:52:33,828 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /<DN1 IP>:57464
2018-12-18 19:52:33,830 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x1646a85ed0c0131 due to java.io.IOException: Len error 1111891
2018-12-18 19:52:33,830 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /<DN1 IP>:57464 which had sessionid 0x1646a85ed0c0131
2018-12-18 19:52:34,740 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /<DN1 IP>:57476
2018-12-18 19:52:34,741 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x1646a85ed0c0131 at /<DN1 IP>:57476
2018-12-18 19:52:34,741 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x1646a85ed0c0131
2018-12-18 19:52:34,741 - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x1646a85ed0c0131 with negotiated timeout 10000 for client /<DN1 IP>:57476


1 REPLY 1

avatar
Master Mentor

@Siva Kumar

From the YARN logs "Session 0x2675d4e5b3b0046 for server prdhdpdn1.example.com/<MN1_IP>:2181, unexpected error" that shows there is a problem with the zookeeper.

Can you share the zookeeper.log in /var/log/* on the 2 nodes where the active and standby RM's are running. Ensure also the NTPD is running and synchronized on all the servers!!

Please revert.