07-29-2015 08:41 PM
Currently, there are multiple reasons for an app to get killed of which some are:
1. Natural termination as the app is over.
2. Container got killed for various limits being exhausted - such as say, physcial memory limit, virtual memory limit, etc.
3. AM got killed more than max-attempts time.
4. Queue level limits being reached.
5. Overall max allowed containers memory limit being reached for a given Node Mgr.
7. Security tokens invalid.
8. Admin killed it through UI.
9. Recovery of apps. not supported when RM or NM restart.
These show up in logs but is there a way we can determine them programatically, say through some Resource Mgr REST API call? Probably, diagnostic field in it can be used to populate the exact reason when YARN is aware of it?
07-29-2015 11:38 PM
Most of the reasons you have given do not cause an application to terminate (pre-emption, queue limits etc).
YARN does not know what the application does internally and what the application considers a failure.
A failure for a mapreduce application is different from a Spark application. YARN can only tell you the final state based on what the ApplicationMaster passes back. That state is reflected in the YARN application state and can be retrieved.
What are you trying to achieve with getting the "exact" reason?
07-30-2015 02:26 AM - edited 07-30-2015 02:27 AM
Ok probably was not knowing full details. But doesn't YARN pre-empt containers? If so, what does pre-emption mean for the app?
The idea behind knowing it programatically is to help debug faster and also make decisions behind how to prevent that next time. Many times for bigger clusters, the exact reason behind termination can be difficult to determine and can take some time.
07-30-2015 05:34 AM
Pre-emption does not cause an attempt failure. If a container is pre-empted the attempt gets moved back to a state that it can be scheduled again. The attempt is not marked as failed and thus the application is not really affected by it, beside a longer run time.
example: a map task is running and gets pre-empted 10 times then it will still be able to start again. If an attempt fails more than max attempts times (default 4) it will not get rescheduled and the attempt will fail and the app will fail.