Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Handling Distcp of large files between 2 clusters

Solved Go to solution

Handling Distcp of large files between 2 clusters

Explorer

Hi all,

 

I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:

 

Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 < filelength = 119488762721
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more

 

Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.

 

As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).

 

such issue is limiting my migration to CDH5, hope you can help. 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Handling Distcp of large files between 2 clusters

Master Guru
Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.

For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.
9 REPLIES 9

Re: Handling Distcp of large files between 2 clusters

Explorer

Hi,

 

Can anyone helpout.

 

This step block my migration from CDH4.

Re: Handling Distcp of large files between 2 clusters

Master Guru
Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.

For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.

Re: Handling Distcp of large files between 2 clusters

New Contributor
The webhdfs did the trick.

Re: Handling Distcp of large files between 2 clusters

Explorer
i have used the below command to copy 36TB to blob using snaoshot.
HADOOP_CLIENT_OPTS="-Xmx40G" hadoop distcp -update -delete $SNAPSHOT_PATH wasbs://buclusterbackup@blobplatformdataxe265ecb.blob.core.windows.net/sep_backup/application_data
getting Azure exception errors and Java IO error.
i re ran with -skipcrccheck still the same error.

Re: Handling Distcp of large files between 2 clusters

Master Guru
Block-level copies (with file merges) is not supported as a DistCp feature yet.

However, you can use the -update options to do progressive copies - resuming upon the last failure.
Highlighted

Re: Handling Distcp of large files between 2 clusters

Hi Harsh, 

 

Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path. 

 

Thanks

Kishore

Re: Handling Distcp of large files between 2 clusters

Master Guru
Copy is done to a temporary file, and then moved to the actual destination
upon completion. There's no "merge", only move. This procedure is done to
ensure partial file copies don't get leftover if the job fails or gets
killed.

Re: Handling Distcp of large files between 2 clusters

Got it.

Re: Handling Distcp of large files between 2 clusters

Explorer
Thanks Harsh.

Indeed i used -update before but it didn't solve the issue, and since i
have 120 cluster nodes, restarting the datanodes wasn't a feasible solution.

What solved the issue was using webhdfs instead of hftp.