Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using DIstcp Between two Clusters

Using DIstcp Between two Clusters

Explorer

I want to copy the data between two clusters. Any suggestions or methods for fast copying of the data then the usual speed.?

4 REPLIES 4
Highlighted

Re: Using DIstcp Between two Clusters

You can use the distcp over a distributed protocol like hdfs or webhdfs. Make sure to have enough mappers to fully saturate the drives and datanodes to get max speed.

Highlighted

Re: Using DIstcp Between two Clusters

Explorer

any network chnages does effect this speed ..like more netwrok interface speed?

Highlighted

Re: Using DIstcp Between two Clusters

I'd add one more thing: if the two clusters are not adjacent network-wise, but instead use some network you have to share with other people, use -bandwidth to limit the actual bandwidth of distcp.

Why? Between two large clusters, the IO bandwidth off disk will actually be higher than the network backbone bandwidth, so saturate it. This will make your network ops team very unhappy. And yes, it has happened at scale.

Highlighted

Re: Using DIstcp Between two Clusters

Rising Star

You might also check this discussion to see if any info is useful: https://community.hortonworks.com/questions/31997/distcp-performance-issue.html

Don't have an account?
Coming from Hortonworks? Activate your account here