I'm running a Spark streaming job in YARN to continuosly process a stream of data.
If an executor dies, Spark recovers gracefully, however if the driver (which is also the AM) dies then YARN finishes the application and considers it "SUCCESSFUL".
If the Spark fails to start (e.g. missing classes) then YARN does retry as many times as configured but not if you kill the process once running.
That creates a single point of failure.
Is there any way to tell YARN to restart the AM if it dies?
The only option I see is creating my own script to monitor the app and launch it again, which is far for ideal
It does depend on how the driver dies. As you said the AM is retried based on the settings under certain circumstances.
You seem to have stumbled onto a case which is not handled correctly. However we would need to know a bit more: why did the driver fail? do you have a log of the container that ran the driver so we can see what the cause was for the driver failure?
Thanks for your answer.
I'm trying to simulate different failure scenarios and recovery.
What I'm doing is killing processes myself to simulate failures and see how the system handles it.
Digging around, having robust long running YARN apps is WIP: https://issues.apache.org/jira/browse/YARN-896
Part of that we have already included via configuration when you run on yarn There are still known issues on the Spark side which makes recovery on the Spark side not straight forward and not robust in all failure cases. AM failures are notoriously hard to recover but I would expect a direct kill of the AM container to be seen as a failure and being picked up by YARN for a restart.
How do you kill the container (kill -9 of the JVM ?)
Yes, kill -9 on the AM process ends the job as SUCCESS.
I'm using CDH 5.4.2
a kill -9 is a SIGKILL. A SIGKILL is not catchable as described in the man page (man 7 signal) which means that we can not catch it in the shutodwn hook that is running and that the default exit code is used. Can you try and use the SIGTERM signal and see if that works?
SIGKILL is just not something that can be handled.
I tried the same with just kill (no -9) and there was no difference, still SUCCESS.
Either way, I expected YARN to notice that it has lost contact with the AM without a proper shutdown and not assume it finished ok.
Unfortunately that is not how YARN detects that something fails.
The failure detection is based on the return value of the application. The retrun value is send to the RM which then changes the state of the container based on that value and the container gets rescheduled based on that. Since the kill sets the result to SUCCESS YARN can't do anything about rescheduling. There is some discrepancy here and a simple kill of the container should not return SUCCESS.
Can you check the YARN container logs for the Spark ApplicationMaster and look for messages around shutdown hooks?