Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

distcp: java.net.SocketTimeoutException: Read timed out

avatar
Explorer

We are trying to distcp between two clusters and getting I/O errors. On failed attempts while trying to copy a file, AM tries to re-run and succeeds sometimes on 2nd or 3rd attempt but copy fails completely after 4 attempts due to default setting on AM.

What is strange is that there is not particular pattern in failures. Distcp on same files succeeds if retried after some time.

Any pointers would be helpful.

Using HDP-2.4.2 (Namenode HA)

2016-10-19 17:31:51,011 FATAL [IPC Server handler 14 on 41818] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1476385594016_120501_m_000012_0 - exited : java.io.IOException: File copy failed: hftp://a.b.c.d:50070/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 --> hdfs://kandula/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:285)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:253)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://a.b.c.d:50070/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0 to hdfs://kandula/user/hive/warehouse/attribution_impsclicks_daily/action=click/wrt_dt=2016-09-25/000089_0
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:281)
	... 10 more
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.net.SocketTimeoutException: Read timed out
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getInputStream(RetriableFileCopyCommand.java:302)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:247)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:183)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
	... 11 more
1 ACCEPTED SOLUTION

avatar
Rising Star

Hi @aawasthi, I know it has been a while since you asked this question, but I ran into a similar issue, and it can be caused by many things. What I would do in your case is to check if there is some firewall between the 2 datanodes (you can try with telnet), and if there isn't, try checking the number of *_wait connections on the source datanodes. I have found that some of the replicas for what I was trying to copy were placed on a datanode which was technically working, but had a lot of connections in a close_wait state, which were using the overall limits. Feel free to take a look at the answer below: https://community.hortonworks.com/questions/38822/hdfs-exception.html#answer-38817 and the one below that, if you need more details. I hope it helps, camypaj

View solution in original post

1 REPLY 1

avatar
Rising Star

Hi @aawasthi, I know it has been a while since you asked this question, but I ran into a similar issue, and it can be caused by many things. What I would do in your case is to check if there is some firewall between the 2 datanodes (you can try with telnet), and if there isn't, try checking the number of *_wait connections on the source datanodes. I have found that some of the replicas for what I was trying to copy were placed on a datanode which was technically working, but had a lot of connections in a close_wait state, which were using the overall limits. Feel free to take a look at the answer below: https://community.hortonworks.com/questions/38822/hdfs-exception.html#answer-38817 and the one below that, if you need more details. I hope it helps, camypaj