I am facing a peculiar situation that I need some help with. I have a job that succeeds just fine most times but recently has been giving me troubles. I am running spark 2.1 on YARN from a jupyter notebook on HDP 2.6.2
I can see that the spark session shuts down for some unknown reason and the job fails.
When I dig through the errors I can see that from the YARN logs, the spark context has been shutdown because of executor failures with the following error message.
18/04/04 11:20:41 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (10) reached)
I was curious about why this started happening but I also noticed that recently I was a lot of error messages about initialization of containers failures as seen below.
WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e358_1521934670303_16356_01_000071 on host: x. Exit status: -1000. Diagnostics: Application application_1521934670303_16356 initialization failed (exitCode=255) with output: main : command provided 0 main : run as user is x main : requested yarn user is x
And as expected, I increased the threshold for executor failure from the default to 100 and set the timeout interval to 15 minutes because this was a long running notebook and it worked. But what I would like to understand is what is causing these containers to fail?
I couldnt find anything interesting on the yarn logs or the driver logs. The container logs dont exist at all because the initialization failed directly. I am not sure if this could be because of preemption on the YARN queues.
Any help on understanding/debugging these logs would be incredibly useful.
You would have to start by looking into the executor failures. As you said that this jobs was working fine earlier and recently you were facing this issue. In that case I believe the maximum executor failures was set to 10 and it was working fine. But now the no of executor failures started increasing more than 10. Executor failures may be due to resource unavailability as well. So you may need to consider the cluster resource/ memory availability at the time of your job execution as well. Hope it helps!
Enough resources were available on the cluster, so dont think that was an issue. And I have dynamic resource allocation configured so it should just not allocate more resources than what is available and scale up when it needs more. I havent really changed the memory or CPUs requested per executor so that shouldnt be a problem either.
Turns out that some of our node managers had some inconsistency with their java version for some reason and this caused issues with memory allocation during the executor creation! Disabled those node managers for now and the issue disappears. It was quite difficult to trace this issue without a proper stacktrace from YARN that lead to this issue.