Support Questions

Find answers, ask questions, and share your expertise

Rack awareness during HDFS replication

avatar
New Contributor

We are building a new Cloudera cluster and replicating the HDFS data from an existing cluster. This existing cluster is on two sites and the rack awareness is configured accordingly, with a default replication factor of 3.

 

If we are building this new cluster at one of these two sites, is it possible to ensure that HDFS is replicating from the same physical location and not from the other site? The background: we don't want to cause a big network load between the two sites if all the data is already locally available.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Are you using distcp for migration? If reducing heavy load on network is your requirement and you are ok with the migration taking longer, then there is a -bandwidth option in distcp that can help. You can specify the maximum bandwith a map operation can use. You'd of course first need to estimate the number of map operations to be executed. Otherwise, I'm not aware of any rack aware hdfs migration approach. 

View solution in original post

2 REPLIES 2

avatar
Master Collaborator

Are you using distcp for migration? If reducing heavy load on network is your requirement and you are ok with the migration taking longer, then there is a -bandwidth option in distcp that can help. You can specify the maximum bandwith a map operation can use. You'd of course first need to estimate the number of map operations to be executed. Otherwise, I'm not aware of any rack aware hdfs migration approach. 

avatar
New Contributor

I was afraid of that. Yes, I am using distcp for migration. Thanks very much nevertheless for your reply. The bandwidth option might be a very last resort, but probably, that will have to do.