Support Questions
Find answers, ask questions, and share your expertise

Using DIstcp Between two Clusters


I want to copy the data between two clusters. Any suggestions or methods for fast copying of the data then the usual speed.?


You can use the distcp over a distributed protocol like hdfs or webhdfs. Make sure to have enough mappers to fully saturate the drives and datanodes to get max speed.


any network chnages does effect this speed more netwrok interface speed?

I'd add one more thing: if the two clusters are not adjacent network-wise, but instead use some network you have to share with other people, use -bandwidth to limit the actual bandwidth of distcp.

Why? Between two large clusters, the IO bandwidth off disk will actually be higher than the network backbone bandwidth, so saturate it. This will make your network ops team very unhappy. And yes, it has happened at scale.

Rising Star

You might also check this discussion to see if any info is useful: