I've a mistake with my app in cluster mode with Spark. My app have 2 containers : one driver and one executor.
When the executor container is killed (by an error or what else), my app doesn't give an other container or kill the current attempt to reset the job. The job is like a zombie, the driver doesn't see it and continue without any error.
I missed maybe something about the yarn conf or spark-submit's params.