<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Handling Distcp of large files between 2 clusters in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42470#M55015</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 &amp;lt; filelength = 119488762721&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)&lt;BR /&gt;at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)&lt;BR /&gt;... 11 more&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;such issue is limiting my migration to CDH5, hope you can help.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 10:28:12 GMT</pubDate>
    <dc:creator>Fawzea</dc:creator>
    <dc:date>2022-09-16T10:28:12Z</dc:date>
    <item>
      <title>Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42470#M55015</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 &amp;lt; filelength = 119488762721&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)&lt;BR /&gt;at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)&lt;BR /&gt;at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)&lt;BR /&gt;... 11 more&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;such issue is limiting my migration to CDH5, hope you can help.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:28:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42470#M55015</guid>
      <dc:creator>Fawzea</dc:creator>
      <dc:date>2022-09-16T10:28:12Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42614#M55016</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone helpout.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This step block my migration from CDH4.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Jul 2016 16:03:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42614#M55016</guid>
      <dc:creator>Fawzea</dc:creator>
      <dc:date>2016-07-05T16:03:52Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42662#M55017</link>
      <description>Are you using hftp:// or webhdfs://? I'd recommend trying with the latter.&lt;BR /&gt;&lt;BR /&gt;For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.</description>
      <pubDate>Thu, 07 Jul 2016 06:58:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42662#M55017</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2016-07-07T06:58:06Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42663#M55018</link>
      <description>Block-level copies (with file merges) is not supported as a DistCp feature yet.&lt;BR /&gt;&lt;BR /&gt;However, you can use the -update options to do progressive copies - resuming upon the last failure.</description>
      <pubDate>Thu, 07 Jul 2016 06:59:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42663#M55018</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2016-07-07T06:59:49Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42687#M55019</link>
      <description>&lt;P&gt;Hi Harsh,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Kishore&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2016 06:25:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42687#M55019</guid>
      <dc:creator>TheKishore432</dc:creator>
      <dc:date>2016-07-08T06:25:37Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42688#M55020</link>
      <description>Copy is done to a temporary file, and then moved to the actual destination&lt;BR /&gt;upon completion. There's no "merge", only move. This procedure is done to&lt;BR /&gt;ensure partial file copies don't get leftover if the job fails or gets&lt;BR /&gt;killed.&lt;BR /&gt;</description>
      <pubDate>Fri, 08 Jul 2016 06:42:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42688#M55020</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2016-07-08T06:42:08Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42690#M55021</link>
      <description>&lt;P&gt;Got it.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2016 06:55:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42690#M55021</guid>
      <dc:creator>TheKishore432</dc:creator>
      <dc:date>2016-07-08T06:55:44Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42729#M55022</link>
      <description>Thanks Harsh.&lt;BR /&gt;&lt;BR /&gt;Indeed i used -update before but it didn't solve the issue, and since i&lt;BR /&gt;have 120 cluster nodes, restarting the datanodes wasn't a feasible solution.&lt;BR /&gt;&lt;BR /&gt;What solved the issue was using webhdfs instead of hftp.&lt;BR /&gt;</description>
      <pubDate>Sat, 09 Jul 2016 15:57:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/42729#M55022</guid>
      <dc:creator>Fawzea</dc:creator>
      <dc:date>2016-07-09T15:57:08Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/77726#M55023</link>
      <description>The webhdfs did the trick.</description>
      <pubDate>Mon, 30 Jul 2018 18:39:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/77726#M55023</guid>
      <dc:creator>Tee</dc:creator>
      <dc:date>2018-07-30T18:39:56Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Distcp of large files between 2 clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/84553#M55024</link>
      <description>i have used the below command to copy 36TB to blob using snaoshot.&lt;BR /&gt;HADOOP_CLIENT_OPTS="-Xmx40G" hadoop distcp -update -delete $SNAPSHOT_PATH wasbs://buclusterbackup@blobplatformdataxe265ecb.blob.core.windows.net/sep_backup/application_data&lt;BR /&gt;getting Azure exception errors and Java IO error.&lt;BR /&gt;i re ran with -skipcrccheck still the same error.</description>
      <pubDate>Fri, 04 Jan 2019 12:19:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Handling-Distcp-of-large-files-between-2-clusters/m-p/84553#M55024</guid>
      <dc:creator>pra_big</dc:creator>
      <dc:date>2019-01-04T12:19:38Z</dc:date>
    </item>
  </channel>
</rss>

