Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Do I have to worry about performance impact on source remote cluster when running distcp?

Do I have to worry about performance impact on source remote cluster when running distcp?

Explorer

We're migrating some data from production cluster to development cluster, by running distcp from the development cluster. We plan to set-up a script on the dev cluster that run distcp throughout the day that copy certain archive directory in the production cluster. The production cluster is running operations throughout the day, and we don't want to interrupt it. So is it safe to run distcp from dev cluster to migrate data from production cluster when it's still running operational job?

 

The condition is this archive directory we're trying to copy is not being accessed by any operational job. it's just a passive directory exclusively for storage purpose.

 

Thanks.

2 REPLIES 2

Re: Do I have to worry about performance impact on source remote cluster when running distcp?

Super Collaborator

I guess that if you don't throttle the distCp jobs, yes it could affect the performance.

But luckily, you can throttle the distCp command (by specifying the number of concurrent map & the bandwidth available for each map).

 

Check the documentation.

Re: Do I have to worry about performance impact on source remote cluster when running distcp?

Master Guru
Documentation for the DistCp (to cover the two specific scaling points Mathieu has mentioned) can be found here: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D...

Its best to start with a small -m value, and measure impact on the source charts as you ramp it up until you have a desirable outcome.