I'm running CDH5.12 on a site with 100+ nodes and about 10k regions. Upon a Hbase restart a single regionserver wil end up with 4-5k regions and only after the balancer has kicked in will get the regions distributed again, but this can take a long time(~6-8) hours.
It seems that both 'hbase.master.wait.on.regionservers.mintostart' and 'hbase.master.wait.on.regionservers.interval' are the settings that influence the masters behauviour on this.
This led me to the following questions:
1. Is the "Graceful Shutdown Timeout" configuration default of 3 minutes large enough on larger sites? Or does it need to be increased to account for the incoming traffic of the regionservers?
2. Do I need to change the hbase.master.wait.on.* settings to compensate for this larger site and if so would this be a suggestion CM itself could give like it does with many other settings.