I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 < filelength = 119488762721
... 11 more
Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.
As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).
such issue is limiting my migration to CDH5, hope you can help.
Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path.