I am in the process of improving the resilience of our hadoop clusters.
We are using a twin-datacenter architecture; the hadoop cluster nodes are located in two different buildings separated by 10 km with Namenode HA activated.
We are using a replica factor of 4 + 2 rack awareness (on rack per site).
The replica factor of 4 is probably a bit "luxury", but it might protect against the lost of an entire rack (lost of a site) + the lost of some nodes on the remaining site.
In case of losing en entire rack, I am wondering if HDFS will try to replicate the data on the remaining rack, thus we will get 4 replica on the same rack and overconsume space on the remaining rack ?...or will it "disable" the replica that is supposed to be located on the failed rack ?
Does it make sense to create 4 racks (one for each replica) in order to ensure that the data will be replicated on the both sites in a balanced way (2x2) ?
If you are worried about filling the disk space during rack maintenance operation you can configure the balancer to be really slow so that you can have 48 hours window and virtually nothing will be replicated. Or if the situation permits you can take the namenode in safe mode. This will allow read operations but no write.
This is correct "if the number of replica factor is equal to the number of racks, there is no guarantee that there will be a replica spread in each rack." The policy is all the replicas will not be on the same rack.
My concern is regarding the speed of the replication if, let's say one rack is unavailable during 24 / 48hours for maintenance reasons, and in the meantime HDFS is trying to replicate all then data on the remaining rack, thus might saturate the disk space on this rack !
I can't find any documentation mentionning this " HDFS rebalance speed" .
Also it looks to me that, if the number of replica factor is equal to the number of racks, there is no guarantee that there will be a replica spread in each rack.