Created 12-20-2023 05:55 PM
Created 12-20-2023 10:11 PM
@vec Can you check your Zk logs, you will find the actual error in the logs. Seems like ZK is rejecting the RM Connection.
Created on 12-21-2023 04:15 PM - edited 12-21-2023 04:16 PM
I deployed 3 zookeeper nodes and they are running well. And zK logs don't print any errors. After I stopped all zk nodes . The RM log prints below errors:
2023-12-22 07:34:11,542 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cdp3.oia.com/192.168.1.176:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2023-12-22 07:34:11,542 INFO org.apache.zookeeper.ClientCnxn: Socket error occurred: cdp3.oia.com/192.168.1.176:2181: Connection refused
2023-12-22 07:34:11,642 WARN org.apache.zookeeper.Login: TGT renewal thread has been interrupted and will exit.
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.Login: Client successfully logged in.
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.Login: TGT refresh thread started.
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.Login: TGT valid starting at: Fri Dec 22 07:34:10 CST 2023
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.Login: TGT expires: Sat Dec 23 07:34:10 CST 2023
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.Login: TGT refresh sleeping until: Sat Dec 23 03:31:02 CST 2023
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cdp1.oia.com/192.168.1.205:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2023-12-22 07:34:11,645 INFO org.apache.zookeeper.ClientCnxn: Socket error occurred: cdp1.oia.com/192.168.1.205:2181: Connection refused
2023-12-22 07:34:12,746 WARN org.apache.zookeeper.Login: TGT renewal thread has been interrupted and will exit.
2023-12-22 07:34:12,748 INFO org.apache.zookeeper.Login: Client successfully logged in.
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.Login: TGT refresh thread started.
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.Login: TGT valid starting at: Fri Dec 22 07:34:11 CST 2023
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.Login: TGT expires: Sat Dec 23 07:34:11 CST 2023
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.Login: TGT refresh sleeping until: Sat Dec 23 03:33:14 CST 2023
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cdp2.oia.com/192.168.1.169:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2023-12-22 07:34:12,749 INFO org.apache.zookeeper.ClientCnxn: Socket error occurred: cdp2.oia.com/192.168.1.169:2181: Connection refused
Resumed the zk nodes , it looks connection established but throws an exception in the end.
2023-12-22 07:34:33,729 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cdp2.oia.com/192.168.1.169:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2023-12-22 07:34:33,730 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.1.205:56000, server: cdp2.oia.com/192.168.1.169:2181
2023-12-22 07:34:33,757 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server cdp2.oia.com/192.168.1.169:2181, sessionid = 0x2002158fc320000, negotiated timeout = 40000
2023-12-22 07:34:33,758 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: CONNECTED
2023-12-22 07:34:33,781 INFO org.apache.curator.framework.imps.EnsembleTracker: New config event received: {server.1=cdp1.oia.com:3181:4181:participant, version=0, server.3=cdp3.oia.com:3181:4181:participant, server.2=cdp2.oia.com:3181:4181:participant}
2023-12-22 07:34:33,784 INFO org.apache.curator.framework.imps.EnsembleTracker: New config event received: {server.1=cdp1.oia.com:3181:4181:participant, version=0, server.3=cdp3.oia.com:3181:4181:participant, server.2=cdp2.oia.com:3181:4181:participant}
2023-12-22 07:34:33,795 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /confstore/CONF_STORE
at org.apache.zookeeper.KeeperException.create(KeeperException.java:120)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1793)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
at org.apache.hadoop.util.curator.ZKCuratorManager.delete(ZKCuratorManager.java:331)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.format(ZKConfigurationStore.java:148)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.deleteRMConfStore(ResourceManager.java:1658)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1534)
2023-12-22 07:34:33,803 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down ResourceManager at cdp1.oia.com/192.168.1.205
************************************************************/
The workaround I use is that I deleted the zk , yarn queue manager and yarn . And redeployed yarn only . so far it looks good.