Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ResourceManager HA failed to failover

Highlighted

ResourceManager HA failed to failover

Explorer

2017-05-30 15:34:42,826 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationAttemptStateDataProto.parseFrom(YarnServerResourceManagerRecoveryProtos.java:2470)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:608)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:591)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:470)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:592)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1021)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1062)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1058)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1058)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:417)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-05-30 15:34:42,827 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationAttemptStateDataProto.parseFrom(YarnServerResourceManagerRecoveryProtos.java:2470)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:608)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:591)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:470)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:592)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1021)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1062)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1058)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1058)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:417)

 

 

 

 

Finally, only to delete the data on Zookeeper.

  

Are these configuration not in Cloudera cluster cos the HA failed ?

 

BTW I went through the codes in yarnconfiguration

 

 

 

 ResourceManager

protected void serviceStart() throws Exception {
RMStateStore rmStore = rmContext.getStateStore();
// The state store needs to start irrespective of recoveryEnabled as apps
// need events to move to further states.
rmStore.start();

if(recoveryEnabled) {
try {
LOG.info("Recovery started");
rmStore.checkVersion();
if (rmContext.isWorkPreservingRecoveryEnabled()) {
rmContext.setEpoch(rmStore.getAndIncrementEpoch());
}
RMState state = rmStore.loadState();
recover(state);
LOG.info("Recovery ended");
} catch (Exception e) {
// the Exception from loadState() needs to be handled for
// HA and we need to give up master status if we got fenced
LOG.error("Failed to load/recover state", e);
throw e;
}
}

  

  due to use Zookeeper ,so the concrete class is 

ZKRMStateStore
public synchronized RMState loadState() throws Exception {
RMState rmState = new RMState();
// recover DelegationTokenSecretManager
loadRMDTSecretManagerState(rmState);
// recover RM applications
loadRMAppState(rmState);
// recover AMRMTokenSecretManager
loadAMRMTokenSecretManagerState(rmState);
// recover reservation state
loadReservationSystemState(rmState);

return rmState;
}

  

 

updated-----------------------updated-----------------------updated-----------------------

 

 

The failure point is here

 

public synchronized RMState loadState() throws Exception {
RMState rmState = new RMState();
// recover DelegationTokenSecretManager
loadRMDTSecretManagerState(rmState);
// recover RM applications
loadRMAppState(rmState);
// recover AMRMTokenSecretManager
loadAMRMTokenSecretManagerState(rmState);
// recover reservation state
loadReservationSystemState(rmState);

return rmState;
}

 

 

 

 

 

 

 

Someone who knows the reason please leave notes  here 

 

Thank you cloudera and the employees