Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Getting exact reason for app termination through a program.

Getting exact reason for app termination through a program.

Contributor

Currently, there are multiple reasons for an app to get killed of which some are:

 

1. Natural termination as the app is over.

2. Container got killed for various limits being exhausted - such as say, physcial memory limit, virtual memory limit, etc. 

3. AM got killed more than max-attempts time.

4. Queue level limits being reached.

5. Overall max allowed containers memory limit being reached for a given Node Mgr.

6. Pre-emption.

7. Security tokens invalid.

8. Admin killed it through UI.

9. Recovery of apps. not supported when RM or NM restart.

10. Etc.

 

These show up in logs but is there a way we can determine them programatically, say through some Resource Mgr REST API call? Probably, diagnostic field in it can be used to populate the exact reason when YARN is aware of it? 

3 REPLIES 3

Re: Getting exact reason for app termination through a program.

Super Collaborator

Most of the reasons you have given do not cause an application to terminate (pre-emption, queue limits etc).

YARN does not know what the application does internally and what the application considers a failure.

A failure for a mapreduce application is different from a Spark application. YARN can only tell you the final state based on what the ApplicationMaster passes back. That state is reflected in the YARN application state and can be retrieved.

 

What are you trying to achieve with getting the "exact" reason?

 

Wilfred

Re: Getting exact reason for app termination through a program.

Contributor

Ok probably was not knowing full details. But doesn't YARN pre-empt containers? If so, what does pre-emption mean for the app? 

 

The idea behind knowing it programatically is to help debug faster and also make decisions behind how to prevent that next time.  Many times for bigger clusters, the exact reason behind termination can be difficult to determine and can take some time.

Re: Getting exact reason for app termination through a program.

Super Collaborator

Pre-emption does not cause an attempt failure. If a container is pre-empted the attempt gets moved back to a state that it can be scheduled again. The attempt is not marked as failed and thus the application is not really affected by it, beside a longer run time.

example: a map task is running and gets pre-empted 10 times then it will still be able to start again. If an attempt fails more than max attempts times (default 4) it will not get rescheduled and the attempt will fail and the app will fail.

 

Wilfred