Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Fail tolerance in long running YARN jobs (Spark Streaming)

Fail tolerance in long running YARN jobs (Spark Streaming)

Contributor

I'm running a Spark streaming job in YARN to continuosly process a stream of data.

If an executor dies, Spark recovers gracefully, however if the driver (which is also the AM) dies then YARN finishes the application and considers it "SUCCESSFUL".

If the Spark fails to start (e.g. missing classes) then YARN does retry as many times as configured but not if you kill the process once running.

That creates a single point of failure.

 

Is there any way to tell YARN to restart the AM if it dies? 

The only option I see is creating my own script to monitor the app and launch it again, which is far for ideal

 

 

7 REPLIES 7

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Super Collaborator

It does depend on how the driver dies. As you said the AM is retried based on the settings under certain circumstances.

You seem to have stumbled onto a case which is not handled correctly. However we would need to know a bit more: why did the driver fail? do you have a log of the container that ran the driver so we can see what the cause was for the driver failure?

 

Wilfred

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Contributor

Thanks for your answer.

 

I'm trying to simulate different failure scenarios and recovery.

What I'm doing is killing processes myself to simulate failures and see how the system handles it.

 

Digging around, having robust long running YARN apps is WIP: https://issues.apache.org/jira/browse/YARN-896

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Super Collaborator

Part of that we have already included via configuration when you run on yarn There are still known issues on the Spark side which makes recovery on the Spark side not straight forward and not robust in all failure cases. AM failures are notoriously hard to recover but I would expect a direct kill of the AM container to be seen as a failure and being picked up by YARN for a restart.

How do you kill the container (kill -9 of the JVM ?)

 

Wilfred

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Contributor

Yes, kill -9 on the AM process ends the job as SUCCESS.

I'm using CDH 5.4.2

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Super Collaborator

a kill -9 is a SIGKILL. A SIGKILL is not catchable as described in the man page (man 7 signal) which means that we can not catch it in the shutodwn hook that is running and that the default exit code is used. Can you try and use the SIGTERM signal and see if that works?

SIGKILL is just not something that can be handled.

 

Wilfred

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Contributor

I tried the same with just kill (no -9) and there was no difference, still SUCCESS.

 

Either way, I expected YARN to notice that it has lost contact with the  AM without a proper shutdown and not assume it finished ok.

Re: Fail tolerance in long running YARN jobs (Spark Streaming)

Super Collaborator

Unfortunately that is not how YARN detects that something fails.

The failure detection is based on the return value of the application. The retrun value is send to the RM which then changes the state of the container based on that value and the container gets rescheduled based on that. Since the kill sets the result to SUCCESS YARN can't do anything about rescheduling. There is some discrepancy here and a simple kill of the container should not return SUCCESS.

 

Can you check the YARN container logs for the Spark ApplicationMaster and look for messages around shutdown hooks?

 

Wilfred

Don't have an account?
Coming from Hortonworks? Activate your account here