Reply
Highlighted
Explorer
Posts: 23
Registered: ‎09-01-2014

Do I have to worry about performance impact on source remote cluster when running distcp?

We're migrating some data from production cluster to development cluster, by running distcp from the development cluster. We plan to set-up a script on the dev cluster that run distcp throughout the day that copy certain archive directory in the production cluster. The production cluster is running operations throughout the day, and we don't want to interrupt it. So is it safe to run distcp from dev cluster to migrate data from production cluster when it's still running operational job?

 

The condition is this archive directory we're trying to copy is not being accessed by any operational job. it's just a passive directory exclusively for storage purpose.

 

Thanks.

Posts: 177
Topics: 8
Kudos: 28
Solutions: 19
Registered: ‎07-16-2015

Re: Do I have to worry about performance impact on source remote cluster when running distcp?

I guess that if you don't throttle the distCp jobs, yes it could affect the performance.

But luckily, you can throttle the distCp command (by specifying the number of concurrent map & the bandwidth available for each map).

 

Check the documentation.

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: Do I have to worry about performance impact on source remote cluster when running distcp?

Documentation for the DistCp (to cover the two specific scaling points Mathieu has mentioned) can be found here: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D...

Its best to start with a small -m value, and measure impact on the source charts as you ramp it up until you have a desirable outcome.