Support Questions

Find answers, ask questions, and share your expertise

Yarn (Resource manager) stopping automatically

avatar
Contributor

Hi Team,

ResourceManager servcie is stopping automatically with in few sec.

I have not found any error/exceptions in resourcemanager logs. I am suspecting that there is some issue with Zookeeper. I have three zookeeper services. below are the logs of resource manager and two zookeeper services. Please Help with it.

Resource Manager Logs:

2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.compiler=<NA>
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.name=Linux
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.version=3.10.0-693.el7.x86_64
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.name=yarn
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.home=/home/yarn
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/usr/hdp/2.6.3.0-235/hadoop-yarn
2018-05-30 09:08:38,060 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=hdp01.mydomain.com:2181,hdp03.mydomain.com:2181,hdp02.mydomain.com:2181 sessionTimeout=10000 watcher=null
2018-05-30 09:08:38,179 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:createConnection(1276)) - Created new ZK connection
2018-05-30 09:08:38,200 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server hdp02.mydomain.com/192.168.3.19:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-30 09:08:38,220 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established, initiating session, client: /192.168.3.18:56340, server: hdp02.mydomain.com/192.168.3.19:2181
2018-05-30 09:08:38,279 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server hdp02.mydomain.com/192.168.3.19:2181, sessionid = 0x263b11a44120001, negotiated timeout = 10000
2018-05-30 09:08:38,495 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:run(359)) - Fencing node /rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK doesn't exist to delete
2018-05-30 09:08:38,793 INFO  resourcemanager.ResourceManager (ResourceManager.java:serviceStart(597)) - Recovery started
2018-05-30 09:08:38,851 INFO  recovery.RMStateStore (RMStateStore.java:checkVersion(639)) - Loaded RM state version info 1.2

Zookeeper 1 logs:

2018-05-30 09:08:38,208 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:56340
2018-05-30 09:08:38,246 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /192.168.3.18:56340
2018-05-30 09:08:38,272 - INFO  [CommitProcessor:2:ZooKeeperServer@617] - Established session 0x263b11a44120001 with negotiated timeout 10000 for client /192.168.3.18:56340
2018-05-30 09:08:38,285 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x1 zxid:0x800000042 txntype:-1 reqpath:n/a Error Path:/rmstore Error:KeeperErrorCode = NodeExists for /rmstore
2018-05-30 09:08:38,344 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x2 zxid:0x800000043 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot
2018-05-30 09:08:38,447 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@590] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:multi cxid:0x4 zxid:0x800000045 txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK Error:KeeperErrorCode = NoNode for /rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK
2018-05-30 09:08:38,510 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x5 zxid:0x800000046 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMAppRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMAppRoot
2018-05-30 09:08:38,535 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x6 zxid:0x800000047 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot
2018-05-30 09:08:38,602 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x7 zxid:0x800000048 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot
2018-05-30 09:08:38,666 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x8 zxid:0x800000049 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot
2018-05-30 09:08:38,724 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x9 zxid:0x80000004a txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber
2018-05-30 09:08:38,765 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0xa zxid:0x80000004b txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot
2018-05-30 09:08:45,736 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.19:38248
2018-05-30 09:08:45,736 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.19:38248
2018-05-30 09:08:45,767 - INFO  [Thread-35:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.19:38248 (no session established for client)
2018-05-30 09:09:02,037 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x263b11a44120001 due to java.io.IOException: Connection reset by peer
2018-05-30 09:09:02,038 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:56340 which had sessionid 0x263b11a44120001
2018-05-30 09:09:12,009 - INFO  [SessionTracker:ZooKeeperServer@347] - Expiring session 0x263b11a44120001, timeout of 10000ms exceeded
2018-05-30 09:09:12,017 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@492] - Processed session termination for sessionid: 0x263b11a44120001

Zookeeper 2 logs:

2018-05-30 09:08:46,033 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:33482
2018-05-30 09:08:46,033 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.18:33482
2018-05-30 09:08:46,083 - INFO  [Thread-20:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:33482 (no session established for client)
2018-05-30 09:09:46,017 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:33584
2018-05-30 09:09:46,018 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.18:33584
2018-05-30 09:09:46,052 - INFO  [Thread-21:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:33584 (no session established for client)

Please help with it. Thanks in Advance.

-Paramesh.

1 ACCEPTED SOLUTION

avatar
Contributor

It got resolved, once move the RM to another node in the cluster.

View solution in original post

4 REPLIES 4

avatar

Hey @Paramesh malla !

Could check if your yarn.resourcemanager.recovery.enabled is true?

avatar
Contributor

Hi Vinicius, yes it is already enabled. I am still having this issue.

avatar
Contributor

It got resolved, once move the RM to another node in the cluster.

avatar

Can you please explain the solution in detail?