We've added new DataNodes to the cluster. These new DataNodes have bigger disks than the old ones. The balancer is running flawlessly for days now but my concern is that because the balancing is using percentage for spreading blocks - the end result would be that most of the data would be located in the new DNs and a way smaller amount in the old DNs. (For example - old DNs are 20TB, new DNs are 50 TB - so end result would be 2 TB in old DNs and 5 TB in new if threshold is 10%).
1. Doesn't this division of data misses the parallel computing advantage ? (most of the data is centralized in few DNs) 2. In case a new DN is down for any reason the recovery process (fixing under replicated blocks) would take longer, no ?
Wouldn't it be smart to stop the balancer when data is spread evenly in size and not in percentage ?
Balancer works on maintaining threshold percentage, by default its 10% so it doesn't matter if new nodes are of higher capacity then older.
1. Doesn't this division of data misses the parallel computing advantage ? (most of the data is centralized in few DNs)
If your cluster is un-balanced then yes it will affect your MapReduce job performance because of jobs getting scheduled on the Datanodes with larger dataset.
2. In case a new DN is down for any reason the recovery process (fixing under replicated blocks) would take longer, no ?
Yes thats correct
3. Wouldn't it be smart to stop the balancer when data is spread evenly in size and not in percentage ?
You can stop balancer at any time, it's safe to stop it by pressing ctrl+c command.
Since you added new high capacity nodes into the cluster therefore first it would be recommended to run HDFS balance multiple times specially when your cluster is less load or any idle window(for best output) and we should start balancer threshold value from a high number to lower value, Example start with 10 and end with 5. Though it is not guaranteed that you will get fully balanced cluster as per threshold in single shot but balancer will try it best during multiple runs. I think it tries to balance 10G of data at once. Also namenode has limits on how many blocks can be moved at one time, In addition to that you may also consider increasing below properties values to optimize balancer.
You don't need to stop the balancer. Many of our customers and some large installations of HDFS are continuously running balancer. In fact we have often talked about making balancer a part of namenode or a service within it. So I would not worry about stopping balancer, instead I would worry if it was not running at all.
Hi Adi, the threshold means that the utilization of storage on each node after balancing will be (ACU +- threshold) where I use ACU to denote "average cluster utilization". Example:
(1) Before adding new nodes: Let's say you have 10 data nodes, each has capacity of 20T, and your data size is 100T. In this case ACU=50% and if all nodes are perfectly balanced, each stores 10T of data.
(2) After: Let's say you add 4 large nodes, each with capacity of 50T, and you still have 100T of data. Your total capacity is now doubled to 400T, and therefore ACU=25%. However, your new nodes are empty. Running the balancer with threshold th=10% will ensure that utilization of all nodes is between ACU-th and ACU+th, in this case between 15% and 35%. We are starting with old nodes at 50% and new nodes at 0% of utilization. Balancer will keep on moving data until old nodes' utilization is <= 35% and new nodes' utilization is >= 15%, which means old nodes keeping less than 20*0.35=7T and new nodes keeping more than 50*0.15=7.5T. As you can see, in this particular case data-per-node amounts are not so far away from each other, but as you keep on adding more data the differences will grow up little by little.
I think in the large cluster, we need to scale-out datanode everyday, and add new datanode day by day, so we keep running balancer all time.