Reply
PL
New Contributor
Posts: 3
Registered: ‎11-14-2013

CDH 5 YARN Resource Manager HA - deadlock in Kerberos cluster

2014-06-27 17:07:31,749 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=cblrv9032l.sunlab.ca:2181,cblrv9031l.sunlab.ca:2181,cblrv9033l.sunlab.ca:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@503b8f3f
2014-06-27 17:07:31,750 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cblrv9032l.sunlab.ca/10.126.180.32:2181. Will not attempt to authenticate using SASL (unknown error)
2014-06-27 17:07:31,751 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to cblrv9032l.sunlab.ca/10.126.180.32:2181, initiating session
2014-06-27 17:07:31,755 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server cblrv9032l.sunlab.ca/10.126.180.32:2181, sessionid = 0x146ded547c3078a, negotiated timeout = 10000
2014-06-27 17:07:31,755 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-06-27 17:07:31,755 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2014-06-27 17:07:31,758 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2014-06-27 17:07:31,758 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a067961726e524d1205726d313739
2014-06-27 17:07:31,759 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/yarnRM/ActiveBreadCrumb to indicate that the local node is the most recent active...
2014-06-27 17:07:31,761 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/var/run/cloudera-scm-agent/process/756-yarn-RESOURCEMANAGER/yarn-site.xml
2014-06-27 17:07:31,765 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS
2014-06-27 17:07:31,765 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state
2014-06-27 17:07:31,765 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=transitionToActive TARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed
2014-06-27 17:07:31,765 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
 at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
 at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
 at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
 ... 4 more
Caused by: org.apache.hadoop.service.ServiceStateException: RMActiveServices cannot enter state STARTED from state STOPPED
 at org.apache.hadoop.service.ServiceStateModel.checkStateTransition(ServiceStateModel.java:129)
 at org.apache.hadoop.service.ServiceStateModel.enterState(ServiceStateModel.java:111)
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:190)
 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
 at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
 ... 5 more
2014-06-27 17:07:31,766 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2014-06-27 17:07:31,768 INFO org.apache.zookeeper.ZooKeeper: Session: 0x146ded547c3078a closed

New Contributor
Posts: 5
Registered: ‎08-13-2014

Re: CDH 5 YARN Resource Manager HA - deadlock in Kerberos cluster

Does this occur during restoration of the jobs?  A client had this same issue after having their AD service accounts expire over the weekend.  The authentication tokens had expired and when trying to restart a job using the same tokens it would fail.  Take a look for other stack traces and see if you can find that job that is failing.  

 

Assuming that this is the issue, then you can remove it from Zookeeper by using zookeeper-client.  You'll need to add the YARN authorization to Zookeeper, which you can find in the newest /var/run/cloudera-scm-agent/process/#####-yarn-RESOURCEMANAGER/yarn-site.xml file under the yarn.resourcemanager.zk-auth property.

 

Once you have the auth and the problem application, you can remove it from Zookeeper:

 

zookeeper-client connect zookeeperMasterServer1:port

addauth digest yarn:yourReallyLongAuthStringHere

rmr /rmstore/ZKRMStateRoot/RMAppRoot/application_#############_####

 

Now restart your RM and you should be all set.

Brian

New Contributor
Posts: 4
Registered: ‎02-09-2015

Re: CDH 5 YARN Resource Manager HA - deadlock in Kerberos cluster

I ran into this today and it fixed it
Cloudera Employee
Posts: 224
Registered: ‎09-23-2013

Re: CDH 5 YARN Resource Manager HA - deadlock in Kerberos cluster

The Trigger for this behavior (depending on release levels) is if Yarn yobs were run before kerberos was enabled, the RM will not start once kerberos is enabled.

 

The cleanup described by Brian is the workaround for this behavior.

New Contributor
Posts: 1
Registered: ‎03-02-2015

Re: CDH 5 YARN Resource Manager HA - deadlock in Kerberos cluster

Is this going to be patched?  I'd think a failure to restart an application shouldn't prevent the entire system from recovering, correct?  Would be nice if this were recoverable error, or at minimum could be overridden behavior in the admin to help clean out broken nodes.

Announcements