Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn (Resource manager) stopping automatically

avatar
Contributor

Hi Team,

ResourceManager servcie is stopping automatically with in few sec.

I have not found any error/exceptions in resourcemanager logs. I am suspecting that there is some issue with Zookeeper. I have three zookeeper services. below are the logs of resource manager and two zookeeper services. Please Help with it.

Resource Manager Logs:

2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.compiler=<NA>
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.name=Linux
2018-05-30 09:08:38,058 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.version=3.10.0-693.el7.x86_64
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.name=yarn
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.home=/home/yarn
2018-05-30 09:08:38,059 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/usr/hdp/2.6.3.0-235/hadoop-yarn
2018-05-30 09:08:38,060 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=hdp01.mydomain.com:2181,hdp03.mydomain.com:2181,hdp02.mydomain.com:2181 sessionTimeout=10000 watcher=null
2018-05-30 09:08:38,179 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:createConnection(1276)) - Created new ZK connection
2018-05-30 09:08:38,200 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server hdp02.mydomain.com/192.168.3.19:2181. Will not attempt to authenticate using SASL (unknown error)
2018-05-30 09:08:38,220 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(864)) - Socket connection established, initiating session, client: /192.168.3.18:56340, server: hdp02.mydomain.com/192.168.3.19:2181
2018-05-30 09:08:38,279 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1279)) - Session establishment complete on server hdp02.mydomain.com/192.168.3.19:2181, sessionid = 0x263b11a44120001, negotiated timeout = 10000
2018-05-30 09:08:38,495 INFO  recovery.ZKRMStateStore (ZKRMStateStore.java:run(359)) - Fencing node /rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK doesn't exist to delete
2018-05-30 09:08:38,793 INFO  resourcemanager.ResourceManager (ResourceManager.java:serviceStart(597)) - Recovery started
2018-05-30 09:08:38,851 INFO  recovery.RMStateStore (RMStateStore.java:checkVersion(639)) - Loaded RM state version info 1.2

Zookeeper 1 logs:

2018-05-30 09:08:38,208 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:56340
2018-05-30 09:08:38,246 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /192.168.3.18:56340
2018-05-30 09:08:38,272 - INFO  [CommitProcessor:2:ZooKeeperServer@617] - Established session 0x263b11a44120001 with negotiated timeout 10000 for client /192.168.3.18:56340
2018-05-30 09:08:38,285 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x1 zxid:0x800000042 txntype:-1 reqpath:n/a Error Path:/rmstore Error:KeeperErrorCode = NodeExists for /rmstore
2018-05-30 09:08:38,344 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x2 zxid:0x800000043 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot
2018-05-30 09:08:38,447 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@590] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:multi cxid:0x4 zxid:0x800000045 txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK Error:KeeperErrorCode = NoNode for /rmstore/ZKRMStateRoot/RM_ZK_FENCING_LOCK
2018-05-30 09:08:38,510 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x5 zxid:0x800000046 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMAppRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMAppRoot
2018-05-30 09:08:38,535 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x6 zxid:0x800000047 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot
2018-05-30 09:08:38,602 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x7 zxid:0x800000048 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot
2018-05-30 09:08:38,666 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x8 zxid:0x800000049 txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDelegationTokensRoot
2018-05-30 09:08:38,724 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0x9 zxid:0x80000004a txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTSequentialNumber
2018-05-30 09:08:38,765 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@643] - Got user-level KeeperException when processing sessionid:0x263b11a44120001 type:create cxid:0xa zxid:0x80000004b txntype:-1 reqpath:n/a Error Path:/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot Error:KeeperErrorCode = NodeExists for /rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot
2018-05-30 09:08:45,736 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.19:38248
2018-05-30 09:08:45,736 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.19:38248
2018-05-30 09:08:45,767 - INFO  [Thread-35:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.19:38248 (no session established for client)
2018-05-30 09:09:02,037 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x263b11a44120001 due to java.io.IOException: Connection reset by peer
2018-05-30 09:09:02,038 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:56340 which had sessionid 0x263b11a44120001
2018-05-30 09:09:12,009 - INFO  [SessionTracker:ZooKeeperServer@347] - Expiring session 0x263b11a44120001, timeout of 10000ms exceeded
2018-05-30 09:09:12,017 - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@492] - Processed session termination for sessionid: 0x263b11a44120001

Zookeeper 2 logs:

2018-05-30 09:08:46,033 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:33482
2018-05-30 09:08:46,033 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.18:33482
2018-05-30 09:08:46,083 - INFO  [Thread-20:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:33482 (no session established for client)
2018-05-30 09:09:46,017 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /192.168.3.18:33584
2018-05-30 09:09:46,018 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /192.168.3.18:33584
2018-05-30 09:09:46,052 - INFO  [Thread-21:NIOServerCnxn@1008] - Closed socket connection for client /192.168.3.18:33584 (no session established for client)

Please help with it. Thanks in Advance.

-Paramesh.

1 ACCEPTED SOLUTION

avatar
Contributor

It got resolved, once move the RM to another node in the cluster.

View solution in original post

4 REPLIES 4

avatar

Hey @Paramesh malla !

Could check if your yarn.resourcemanager.recovery.enabled is true?

avatar
Contributor

Hi Vinicius, yes it is already enabled. I am still having this issue.

avatar
Contributor

It got resolved, once move the RM to another node in the cluster.

avatar

Can you please explain the solution in detail?