- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Handling Distcp of large files between 2 clusters
- Labels:
-
Apache Hadoop
-
HDFS
Created on 06-29-2016 08:19 PM - edited 09-16-2022 03:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm planning to migare from CDH4 to CDH5 and i'm using DistCp to copy the historical data between the 2 cluster, my problem that each file in CDH4 HDFS exceeds 150 GB and the nodes with 1G network card, the DistCp failed with such error:
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.io.IOException: Got EOF but currentPos = 77828481024 < filelength = 119488762721
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:289)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:257)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more
Almost sure the issue is the network Card, but replacing the network card for 120 Nodes isn't easy task.
As i know the distcp copy files per mapper, is there away to copy a block per mapper? is there a way to split the files and re merge then after the copy ( want to preserver the file name after the merge also).
such issue is limiting my migration to CDH5, hope you can help.
Created 07-06-2016 11:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.
Created 07-05-2016 09:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can anyone helpout.
This step block my migration from CDH4.
Created 07-06-2016 11:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this specific exception in REST based copies, its usually not a fault with the network but a buggy state in the older Jetty used on the source cluster. Typically a rolling restart of the DataNodes will help resolve such a bad state of Jetty where it hangs up on a client mid-way during a response, causing the sudden EOF to the copying client in DistCp when it was expecting the rest of data.
Created 07-30-2018 11:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 01-04-2019 04:19 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HADOOP_CLIENT_OPTS="-Xmx40G" hadoop distcp -update -delete $SNAPSHOT_PATH wasbs://buclusterbackup@blobplatformdataxe265ecb.blob.core.windows.net/sep_backup/application_data
getting Azure exception errors and Java IO error.
i re ran with -skipcrccheck still the same error.
Created 07-06-2016 11:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, you can use the -update options to do progressive copies - resuming upon the last failure.
Created 07-07-2016 11:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Harsh,
Please correct me if I am wrong we recently tested this in our environment, during copy I observed data will be copied to temporary directory and latter will be merged by default to the destination path.
Thanks
Kishore
Created 07-07-2016 11:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
upon completion. There's no "merge", only move. This procedure is done to
ensure partial file copies don't get leftover if the job fails or gets
killed.
Created 07-07-2016 11:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Got it.
Created 07-09-2016 08:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed i used -update before but it didn't solve the issue, and since i
have 120 cluster nodes, restarting the datanodes wasn't a feasible solution.
What solved the issue was using webhdfs instead of hftp.
