1) currently, spark does not migrate to a different node (it may, but that would just be by chance). There is work in-progress to add node blacklisting, but that hasn't been committed yet (https://issues.apache.org/jira/browse/SPARK-8426)
2) Task failure - some exception encounter while running a task, e.g user code throws exception or external to the task such as spark cannot read from HDFS, etc…
Job failure - if a particular task fails 4 times then Sparks gives up and cancels the whole job
Stage failure (this is the trickiest) - this happens when a task attempts to read the *shuffle* data from another node. If it fails to read that shuffle data then it assumes that the remote node is dead (failure may happen due to bad disk, network error, bad node, node overload with other tasks and not responding fast enough, etc…). This is when Spark thinks it needs to regenerate the input data so Spark mark the stage as failed and return the previous stage that generate the input data. If the stage retry fails 4 times, Spark will give up assuming there the cluster has issue.
3) No great answer to this one. The best answer is really just using “yarn logs —applicationId ” to get all the logs in one file so it’s a bit easier to search through to find errors (rather than having to click the log one by one)
4) No, you don’t need any setting for that. Spark should be resilient to single node failures. With that said, there could be bugs in this area. if you encounter that is not the case, please provide the applicationId and cluster information so that I can collect logs and pass it on to our Spark team to analyze.