Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

RM Crashed - STATE_STORE_OP_FAILED.

avatar
Expert Contributor

Today , we have seen the RM crashed and threw the following error message. There are bunch of JIRA tickets related to that error . One of my job is killed but the application is running in orphaned mode. The app_id is displaying in RM-UI.

I am unable to kill that App_id using yarn -application <app_id> . I restarted the RM and ZK but unable to remove that from displaying in RM -UI. It is not consuming any resources. How do I remove it from displaying ?

t: maxCompletedAppsInMemory = 10000, removing app application_1452798563961_0971 from memory:
2016-05-04 19:00:30,449 INFO  capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1193)) - Null container completed...
2016-05-04 19:00:30,568 INFO  capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1193)) - Null container completed...
2016-05-04 19:00:31,251 INFO  capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1193)) - Null container completed...
2016-05-04 19:00:32,252 INFO  capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1193)) - Null container completed...
2016-05-04 19:00:45,325 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(753)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
java.io.IOException: Wait for ZKClient creation timed out
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1073)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1097)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:934)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeRMDelegationTokenAndSequenceNumberState(ZKRMStateStore.java:734)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeRMDelegationTokenAndSequenceNumber(RMStateStore.java:650)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:112)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:49)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.storeToken(AbstractDelegationTokenSecretManager.java:272)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:391)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:47)
        at org.apache.hadoop.security.token.Token.<init>(Token.java:59)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:907)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:291)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
				
1 ACCEPTED SOLUTION

avatar
Super Guru
@Anandha L Ranganathan

Can you try to delete it using rest api. pls find sample link below -

curl -v -X PUT -d '{"state": "KILLED"}''http://localhost:8088/ws/v1/cluster/apps/application_xxxxxxxx_xxxx'

View solution in original post

2 REPLIES 2

avatar
Super Guru
@Anandha L Ranganathan

Can you try to delete it using rest api. pls find sample link below -

curl -v -X PUT -d '{"state": "KILLED"}''http://localhost:8088/ws/v1/cluster/apps/application_xxxxxxxx_xxxx'

avatar

@Anandha L Ranganathan

I will recommend to contact hortonworks support for such cases.