07-28-2015 12:52 AM
I notice in RM logs that sometime my application transitions back from RUNNING to ACCEPTED state. Under what conditions would this happen? I thought this usually happens whenever RM or AM dies and recovers the applications. Such apps would transition from RUNNING --> ACCEPTED. Is that correct?
However, in my case both RM and NM recovery is disabled:
yarn.resourcemanager.recovery.enabled = false
yarn.nodemanager.recovery.enabled = false
07-28-2015 01:02 AM
07-28-2015 03:55 AM
Hi Harsh - You are right, there is a prior attempt which got killed. Here are some log snippets as you asked:
Attempt 1 - app becomes RUNNING
2015-07-24 14:20:40,980 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from LAUNCHED to RUNNING
2015-07-24 14:20:40,981 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from ACCEPTED to RUNNING
Some hrs later the tokens are renewed (900000ms)
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Going to activate master-key with key-id 1834122077 in 900000ms
2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens
2015-07-25 13:56:35,842 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Going to activate master-key with key-id 516071750 in 900000ms
The following 2 log lines keep repeating for the next 900000ms filling up logs:
2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_000001
2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001
2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_00 0001
2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001
This fails (not sure why) and leads to app termination
2015-07-25 14:11:35,877 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8030: readAndProcess from client 10.65.144.85 threw exception [org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1437726395811_0010_000001]
2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1437726395811_0010_01_000001 Container Transitioned from RUNNING to COMPLETED
1st attempt done (RUNNING --> ACCEPTED)
2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from RUNNING to FINAL_SAVING
2015-07-25 14:11:36,889 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from FINAL_SAVING to FAILED
2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from RUNNING to ACCEPTED
2nd attempt starts
2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1437726395811_0010_000002 to scheduler from user: root
2015-07-25 14:11:36,891 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000002 State change from SUBMITTED to SCHEDULED
Not sure what this Null container indicates:
2015-07-25 14:11:36,910 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
07-28-2015 06:08 AM
Hbase on Yarn.
On a side note:
1. Is there a reason for security token to just fail like that after 15 mins of trying? Or I have some setup problem? That seems to be the first reason for attempt one to be killed.
2. The last line about null container - I see it often. Is that a bug? And can that be ignored?
07-29-2015 07:49 PM
The null container log entry that was shown in the earlier message is a code issue which has been fixed in an upcoming release. We printed the worng reference for the container and it would always be a null.
For the state changes:they are correct after we fail the application and we have not exhausted the AM retries it will be pushed back into the queue for scheduling which means the app goes into an ACCEPTED state.
For YARN-3103 you will not see that issue if you are running CDH 5.4 it is part of that release, if you run an earlier version please upgrade.
02-20-2019 06:26 AM
When the first attempt fails, it tries to run again the app. So the status changes from "running" to "accepted". If you check the RM webUI you could see several attempts were run.