I have some replication set up to copy the output of a daily process to another cluster, this uses:
hadoop distcp -update -delete $SOURCE $TARGET
However it occassionally fails (after mapping 100%!) with this error:
19/01/10 16:13:44 INFO mapreduce.Job: map 100% reduce 0% 19/01/10 16:15:51 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server 19/01/10 16:15:51 INFO mapreduce.Job: map 0% reduce NaN% 19/01/10 16:15:51 INFO mapreduce.Job: Job job_1546553376389_1538 failed with state FAILED due to: 19/01/10 16:15:51 ERROR tools.DistCp: Exception encountered java.io.IOException: DistCp failure: Job job_1546553376389_1538 has failed: at org.apache.hadoop.tools.DistCp.execute(DistCp.java:195) at org.apache.hadoop.tools.DistCp.run(DistCp.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:493)
I'm looking for some advice on how to investigate this problem, as I'm not completely sure where to start. Has anyone encountered something similar? What logs might have useful information for failed tasks like this?
Without much context, you should go to YARN --> Resource Manager web UI, find the failed job corresponding to the distcp, and drill into it to find the failed reduce task. You should be able to find out more there in the log.