Physically if the servers randomly distributed and sized across 10 racks,
* Should they be consolidated to logical racking policy and bring it down to 2~3 racks of equal size.
* Is there a recommended advantage / approach towards creating logical Rack mapping instead of Physical ?
* If so how many racks and sizing should be considered, given that 1/3 vs 2/3 replicas per rack distribution in place.
Is your condition of distributing nodes randomly across 10 racks a given? I mean, is that may be a result of using VMs and something out of your control? Or are nodes deliberately distributed across 10 racks to reduce the impact of rack failure when it occurs?
I would like to understand your motivation. This distribution of nodes randomly across 10 racks doesn't sound a like a very good idea. If you are trying to protect against multiple rack failures, then I think if such a situation arises where 3 or 4 racks are down, then you might have a much bigger problem then just your hadoop cluster. On the other hand you will be paying a price in terms of increased complexity and slower write speeds.
This is because of infrastructure team has lot of unused servers / rack space across different racks that could be leveraged for hadoop usage. All these racks are part of same datacenter, and they do consider the network speeds. However, the underlying Question is of essence, regardless of distribution of physical servers location, e.g. if we have 2 racks of 15 servers each, vs 3 racks of 15 servers each, knowing that the replication factor of 3 would distribute the data blocks in 1/3 vs 2/3 replicas per rack. Sizing of Rack becomes crucial as 1 rack might end fill up faster than the other rack, if they are of same size.
I have heard this referred to as the "Swiss Cheese" method of server allocation, because you just put in servers wherever there are holes.
But I hope you have a fast backbone - when Spark of Hadoop jobs that use map/reduce get to the shuffle phase all hell breaks out on the network. Then your net admins will beg you to use separate racks.
HDFS balancer is rack aware and will take care that you don't run into this scenario. Ensure that a lancer doesn't take lot of bandwidth or instead of running balancer in online mode, run it nightly when load is low.