Reply
Highlighted
Explorer
Posts: 8
Registered: ‎07-09-2014

ResourceManager HA failed to failover

[ Edited ]

2017-05-30 15:34:42,826 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationAttemptStateDataProto.parseFrom(YarnServerResourceManagerRecoveryProtos.java:2470)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:608)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:591)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:470)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:592)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1021)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1062)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1058)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1058)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:417)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-05-30 15:34:42,827 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationAttemptStateDataProto.parseFrom(YarnServerResourceManagerRecoveryProtos.java:2470)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:608)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:591)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:470)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:592)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1021)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1062)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1058)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1058)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:122)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:417)

 

 

 

 

Finally, only to delete the data on Zookeeper.

  

Are these configuration not in Cloudera cluster cos the HA failed ?

 

BTW I went through the codes in yarnconfiguration

 

 

 

 ResourceManager

protected void serviceStart() throws Exception {
RMStateStore rmStore = rmContext.getStateStore();
// The state store needs to start irrespective of recoveryEnabled as apps
// need events to move to further states.
rmStore.start();

if(recoveryEnabled) {
try {
LOG.info("Recovery started");
rmStore.checkVersion();
if (rmContext.isWorkPreservingRecoveryEnabled()) {
rmContext.setEpoch(rmStore.getAndIncrementEpoch());
}
RMState state = rmStore.loadState();
recover(state);
LOG.info("Recovery ended");
} catch (Exception e) {
// the Exception from loadState() needs to be handled for
// HA and we need to give up master status if we got fenced
LOG.error("Failed to load/recover state", e);
throw e;
}
}

  

  due to use Zookeeper ,so the concrete class is 

ZKRMStateStore
public synchronized RMState loadState() throws Exception {
RMState rmState = new RMState();
// recover DelegationTokenSecretManager
loadRMDTSecretManagerState(rmState);
// recover RM applications
loadRMAppState(rmState);
// recover AMRMTokenSecretManager
loadAMRMTokenSecretManagerState(rmState);
// recover reservation state
loadReservationSystemState(rmState);

return rmState;
}

  

 

updated-----------------------updated-----------------------updated-----------------------

 

 

The failure point is here

 

public synchronized RMState loadState() throws Exception {
RMState rmState = new RMState();
// recover DelegationTokenSecretManager
loadRMDTSecretManagerState(rmState);
// recover RM applications
loadRMAppState(rmState);
// recover AMRMTokenSecretManager
loadAMRMTokenSecretManagerState(rmState);
// recover reservation state
loadReservationSystemState(rmState);

return rmState;
}

 

 

 

 

 

 

 

Someone who knows the reason please leave notes  here 

 

Thank you cloudera and the employees 

 

Announcements