Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Handling Distcp of large files between 2 clusters

avatar
Explorer

Hi all,

 

I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:

 

Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 < filelength = 119488762721
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more

 

Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.

 

As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).

 

such issue is limiting my migration to CDH5, hope you can help. 

1 ACCEPTED SOLUTION

avatar
Mentor
Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.

For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.

View solution in original post

9 REPLIES 9

avatar
Explorer

Hi,

 

Can anyone helpout.

 

This step block my migration from CDH4.

avatar
Mentor
Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.

For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.

avatar
New Contributor
The webhdfs did the trick.

avatar
Explorer
i have used the below command to copy 36TB to blob using snaoshot.
HADOOP_CLIENT_OPTS="-Xmx40G" hadoop distcp -update -delete $SNAPSHOT_PATH wasbs://buclusterbackup@blobplatformdataxe265ecb.blob.core.windows.net/sep_backup/application_data
getting Azure exception errors and Java IO error.
i re ran with -skipcrccheck still the same error.

avatar
Mentor
Block-level copies (with file merges) is not supported as a DistCp feature yet.

However, you can use the -update options to do progressive copies - resuming upon the last failure.

avatar

Hi Harsh, 

 

Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path. 

 

Thanks

Kishore

avatar
Mentor
Copy is done to a temporary file, and then moved to the actual destination
upon completion. There's no "merge", only move. This procedure is done to
ensure partial file copies don't get leftover if the job fails or gets
killed.

avatar

Got it.

avatar
Explorer
Thanks Harsh.

Indeed i used -update before but it didn't solve the issue, and since i
have 120 cluster nodes, restarting the datanodes wasn't a feasible solution.

What solved the issue was using webhdfs instead of hftp.