06-27-2014 02:08 PM
2014-06-27 17:07:31,749 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=cblrv9032l.sunlab.ca:2181,cblrv9031l
2014-06-27 17:07:31,750 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server cblrv9032l.sunlab.ca/10.126.180.32:2181. Will not attempt to authenticate using SASL (unknown error)
2014-06-27 17:07:31,751 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to cblrv9032l.sunlab.ca/10.126.180.32:2181, initiating session
2014-06-27 17:07:31,755 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server cblrv9032l.sunlab.ca/10.126.180.32:2181, sessionid = 0x146ded547c3078a, negotiated timeout = 10000
2014-06-27 17:07:31,755 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-06-27 17:07:31,755 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2014-06-27 17:07:31,758 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2014-06-27 17:07:31,758 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a067961726e524d1205726d313739
2014-06-27 17:07:31,759 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/yarnRM/ActiveBreadCrumb to indicate that the local node is the most recent active...
2014-06-27 17:07:31,761 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/var/run/cloudera-scm-agent/process/756-yarn-
2014-06-27 17:07:31,765 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAu
2014-06-27 17:07:31,765 INFO org.apache.hadoop.yarn.server.resourcemanager.Reso
2014-06-27 17:07:31,765 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAu
2014-06-27 17:07:31,765 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
... 4 more
Caused by: org.apache.hadoop.service.ServiceStateException: RMActiveServices cannot enter state STARTED from state STOPPED
... 5 more
2014-06-27 17:07:31,766 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2014-06-27 17:07:31,768 INFO org.apache.zookeeper.ZooKeeper: Session: 0x146ded547c3078a closed
08-13-2014 07:13 AM
Does this occur during restoration of the jobs? A client had this same issue after having their AD service accounts expire over the weekend. The authentication tokens had expired and when trying to restart a job using the same tokens it would fail. Take a look for other stack traces and see if you can find that job that is failing.
Assuming that this is the issue, then you can remove it from Zookeeper by using zookeeper-client. You'll need to add the YARN authorization to Zookeeper, which you can find in the newest /var/run/cloudera-scm-agent/process/#####-yarn-RES
Once you have the auth and the problem application, you can remove it from Zookeeper:
zookeeper-client connect zookeeperMasterServer1:port
addauth digest yarn:yourReallyLongAuthStringHere
Now restart your RM and you should be all set.
04-07-2015 05:08 PM
The Trigger for this behavior (depending on release levels) is if Yarn yobs were run before kerberos was enabled, the RM will not start once kerberos is enabled.
The cleanup described by Brian is the workaround for this behavior.
04-13-2015 10:10 AM
Is this going to be patched? I'd think a failure to restart an application shouldn't prevent the entire system from recovering, correct? Would be nice if this were recoverable error, or at minimum could be overridden behavior in the admin to help clean out broken nodes.