Support Questions

sumit.nigam · ‎07-28-2015

I notice in RM logs that sometime my application transitions back from RUNNING to ACCEPTED state. Under what conditions would this happen? I thought this usually happens whenever RM or AM dies and recovers the applications. Such apps would transition from RUNNING --> ACCEPTED. Is that correct?

However, in my case both RM and NM recovery is disabled:

yarn.resourcemanager.recovery.enabled = false

yarn.nodemanager.recovery.enabled = false

Thanks,

Sumit

Harsh J · ‎07-28-2015

Recovery features deal with restarts of the service (RM or NM). An AM
attempt is a separate feature that, like container retries in MR, is a
regular runtime feature.

Do you see your application ID attempt multiple AMs in the RM UI page for
it? Do the RM logs indicate any form of kill or fail for the first
'appattempt' of the AM ID?

View solution in original post

Harsh J · ‎07-28-2015

Recovery features deal with restarts of the service (RM or NM). An AM
attempt is a separate feature that, like container retries in MR, is a
regular runtime feature.

Do you see your application ID attempt multiple AMs in the RM UI page for
it? Do the RM logs indicate any form of kill or fail for the first
'appattempt' of the AM ID?

sumit.nigam · ‎07-28-2015

Hi Harsh - You are right, there is a prior attempt which got killed. Here are some log snippets as you asked:

Attempt 1 - app becomes RUNNING

2015-07-24 14:20:40,980 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from LAUNCHED to RUNNING
2015-07-24 14:20:40,981 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from ACCEPTED to RUNNING

Some hrs later the tokens are renewed (900000ms)
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Going to activate master-key with key-id 1834122077 in 900000ms
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens

2015-07-25 13:56:35,842 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Going to activate master-key with key-id 516071750 in 900000ms

The following 2 log lines keep repeating for the next 900000ms filling up logs:

2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_000001
2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001

...

2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_00 0001
2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001

This fails (not sure why) and leads to app termination
2015-07-25 14:11:35,877 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8030: readAndProcess from client 10.65.144.85 threw exception [org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1437726395811_0010_000001]
2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1437726395811_0010_01_000001 Container Transitioned from RUNNING to COMPLETED

1st attempt done (RUNNING --> ACCEPTED)
2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from RUNNING to FINAL_SAVING
2015-07-25 14:11:36,889 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from FINAL_SAVING to FAILED

2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from RUNNING to ACCEPTED

2nd attempt starts
2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1437726395811_0010_000002 to scheduler from user: root

2015-07-25 14:11:36,891 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000002 State change from SUBMITTED to SCHEDULED

Not sure what this Null container indicates:

2015-07-25 14:11:36,910 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...

Harsh J · ‎07-28-2015

Thanks, that'd explain your transition. What application is this? Is it an
MR2 application, Spark app, or something custom?

sumit.nigam · ‎07-28-2015

Hbase on Yarn.

On a side note:

1. Is there a reason for security token to just fail like that after 15 mins of trying? Or I have some setup problem? That seems to be the first reason for attempt one to be killed.

2. The last line about null container - I see it often. Is that a bug? And can that be ignored?

Thanks,

Sumit

sumit.nigam · ‎07-28-2015

On point 1, I think I am getting hit by

https://issues.apache.org/jira/browse/YARN-3103

Wilfred · ‎07-29-2015

The null container log entry that was shown in the earlier message is a code issue which has been fixed in an upcoming release. We printed the worng reference for the container and it would always be a null.

For the state changes:they are correct after we fail the application and we have not exhausted the AM retries it will be pushed back into the queue for scheduling which means the app goes into an ACCEPTED state.

For YARN-3103 you will not see that issue if you are running CDH 5.4 it is part of that release, if you run an earlier version please upgrade.

Wilfred

evinhas · ‎02-20-2019

When the first attempt fails, it tries to run again the app. So the status changes from "running" to "accepted". If you check the RM webUI you could see several attempts were run.

Cloudera Community

Support Questions

What does state transition RUNNING --> ACCEPTED mean?