My job just failed (1 task failed with 4 attempts on a single node) and rest of the tasks got killed. The node on which the task failed just had a disk failure. My question is:
when a disk failure occurs namenode excludes that disk from accessing any data, so why did the task fail in 4 attempts with a second gap between the attempts on same node? and can this result in the whole job failure? how can i avoid this situation in the future? How can i make the second task attempt on a different node?