Reply
Explorer
Posts: 22
Registered: ‎04-12-2016
Accepted Solution

Handling Distcp of large files between 2 clusters

Hi all,

 

I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:

 

Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 < filelength = 119488762721
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more

 

Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.

 

As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).

 

such issue is limiting my migration to CDH5, hope you can help. 

Explorer
Posts: 22
Registered: ‎04-12-2016

Re: Handling Distcp of large files between 2 clusters

Hi,

 

Can anyone helpout.

 

This step block my migration from CDH4.

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: Handling Distcp of large files between 2 clusters

Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.

For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.
Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: Handling Distcp of large files between 2 clusters

Block-level copies (with file merges) is not supported as a DistCp feature yet.

However, you can use the -update options to do progressive copies - resuming upon the last failure.
Explorer
Posts: 38
Registered: ‎09-29-2015

Re: Handling Distcp of large files between 2 clusters

Hi Harsh, 

 

Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path. 

 

Thanks

Kishore

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: Handling Distcp of large files between 2 clusters

Copy is done to a temporary file, and then moved to the actual destination
upon completion. There's no "merge", only move. This procedure is done to
ensure partial file copies don't get leftover if the job fails or gets
killed.
Highlighted
Explorer
Posts: 38
Registered: ‎09-29-2015

Re: Handling Distcp of large files between 2 clusters

Got it.

Explorer
Posts: 22
Registered: ‎04-12-2016

Re: Handling Distcp of large files between 2 clusters

Thanks Harsh.

Indeed i used -update before but it didn't solve the issue, and since i
have 120 cluster nodes, restarting the datanodes wasn't a feasible solution.

What solved the issue was using webhdfs instead of hftp.
Tee
New Contributor
Posts: 1
Registered: ‎07-30-2018

Re: Handling Distcp of large files between 2 clusters

The webhdfs did the trick.
Announcements