For the last couple days I have been working on tuning our cluster HDFS rebalance. The cluster had a large disparity in data due to not running the rebalance tool and needed to move ~150tb of data to rebalance.
I managed to tune the settings to move around 180gb every 5 minutes so it should be able to rebalance in about 3 days. The job failed over the weekend due to the kerberos ticket expiring so after restarting it the transfer rate is now around 60gb every 5mins. I restarted the balancer again and now it's around 25gb every 5mins. At this rate it will take over 2 weeks to balance the cluster.
I am running the balancer with debug INFO level and do not see any block failures during any iterations of the balancer. I am monitoring the disk and network io during the balancer runs and we are not anywhere near saturation of either.
Has anyone experienced behaviors with the balancer like this?