I am managing a CDH 5.13 cluster with 4 datanodes. Each datanode had 10 x 2.7 TB disks (~90% used) and we just added another 8 x 3.6 TB disks on each node.
I did a "Rebalance" on HDFS service, which apparently did nothing as all nodes have the same disk usage (in total).
Now, I followed this post in order to intra-node-balance the disks (with threshold set to 25). After 1 hour of execution, the progress is terribly slow (as you can see in the last /data/18 node which gets data):
$ sudo df -h ... /dev/sdc1 2.8T 2.4T 361G 88% /data/02 /dev/sdk1 2.8T 2.5T 315G 89% /data/10 /dev/sdg1 2.8T 2.5T 308G 89% /data/07 /dev/sdi1 2.8T 2.5T 314G 89% /data/08 /dev/sdj1 2.8T 2.5T 300G 90% /data/09 /dev/sde1 2.8T 2.5T 299G 90% /data/04 /dev/sdf1 2.8T 2.5T 303G 90% /data/06 /dev/sdh1 2.8T 2.4T 353G 88% /data/05 /dev/sdb1 2.8T 806G 2.0T 29% /data/01 /dev/sdd1 2.8T 2.5T 298G 90% /data/03 ---#NEW DISKS#--- /dev/sdl1 3.7T 35M 3.7T 1% /data/11 /dev/sdm1 3.7T 36M 3.7T 1% /data/12 /dev/sdn1 3.7T 34M 3.7T 1% /data/13 /dev/sdo1 3.7T 35M 3.7T 1% /data/14 /dev/sdp1 3.7T 34M 3.7T 1% /data/15 /dev/sdq1 3.7T 34M 3.7T 1% /data/16 /dev/sdr1 3.7T 34M 3.7T 1% /data/17 /dev/sds1 3.7T 26G 3.7T 1% /data/18
I would like to ask the following:
1. Currently, there are not pipelines accessing HDFS, but tomorrow morning there will be, and it's obvious from the progress that disk balancing won't have finished. Is it safe to leave this process to finish, while having the cluster in production?
2. Is there something I can do to speed things up?
3. How can I terminate this process, safely, if required?
I am copying this from the Apache documentation:
"A plan can be executed against an operational data node. Disk balancer should not interfere with other processes since it throttles how much data is copied every second."
Does "should not" means "does not" here or "other processes should not run while the balancer runs"?