Created 10-19-2016 07:34 PM
We are trying to distcp between two clusters and getting I/O errors. On failed attempts while trying to copy a file, AM tries to re-run and succeeds sometimes on 2nd or 3rd attempt but copy fails completely after 4 attempts due to default setting on AM.
What is strange is that there is not particular pattern in failures. Distcp on same files succeeds if retried after some time.
Any pointers would be helpful.
Using HDP-2.4.2 (Namenode HA)
2016-10-19 17:31:51,011 FATAL [IPC Server handler 14 on 41818] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1476385594016_120501_m_000012_0 - exited : java.io.IOException: File copy failed: hftp://a.b.c.d:50070/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 --> hdfs://kandula/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:285) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:253) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://a.b.c.d:50070/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 to hdfs://kandula/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:281) ... 10 more Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getInputStream(RetriableFileCopyCommand.java:302) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:247) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:183) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99) at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) ... 11 more
Created 03-09-2017 09:32 AM
Hi @aawasthi, I know it has been a while since you asked this question, but I ran into a similar issue, and it can be caused by many things. What I would do in your case is to check if there is some firewall between the 2 datanodes (you can try with telnet), and if there isn't, try checking the number of *_wait connections on the source datanodes. I have found that some of the replicas for what I was trying to copy were placed on a datanode which was technically working, but had a lot of connections in a close_wait state, which were using the overall limits. Feel free to take a look at the answer below: https://community.hortonworks.com/questions/38822/hdfs-exception.html#answer-38817 and the one below that, if you need more details. I hope it helps, camypaj
Created 03-09-2017 09:32 AM
Hi @aawasthi, I know it has been a while since you asked this question, but I ran into a similar issue, and it can be caused by many things. What I would do in your case is to check if there is some firewall between the 2 datanodes (you can try with telnet), and if there isn't, try checking the number of *_wait connections on the source datanodes. I have found that some of the replicas for what I was trying to copy were placed on a datanode which was technically working, but had a lot of connections in a close_wait state, which were using the overall limits. Feel free to take a look at the answer below: https://community.hortonworks.com/questions/38822/hdfs-exception.html#answer-38817 and the one below that, if you need more details. I hope it helps, camypaj