Created 10-20-2016 07:36 AM
Hello,
I'm trying to rebalance hdfs in our HDP 2.4.3 cluster (which is running namenode HA) and I am having a problem that the balancer only does actual work for a short time and then just sits and idles. If I kill the process and restart it, it will do some balancing immediately and then go into idle again. I have repeated this many times now.
I enabled debug logging for the balancer but I can't see anything in there that explains why it just stops balancing.
Here is the beginning of the log (since it shows some parameters that might be relevant):
16/10/19 16:34:10 INFO balancer.Balancer: namenodes = [hdfs://PROD1] 16/10/19 16:34:10 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false] 16/10/19 16:34:10 INFO balancer.Balancer: included nodes = [] 16/10/19 16:34:10 INFO balancer.Balancer: excluded nodes = [] 16/10/19 16:34:10 INFO balancer.Balancer: source nodes = [] 16/10/19 16:34:11 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 16:34:11 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 500 (default=5) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 16/10/19 16:34:11 INFO net.NetworkTopology: Adding a new node: /default-rack/X.X.X.X:1019 .... 16/10/19 16:34:11 INFO balancer.Balancer: Need to move 11.83 TB to make the cluster balanced. ... 16/10/19 16:34:11 INFO balancer.Balancer: Will move 120 GB in this iteration 16/10/19 16:34:11 INFO balancer.Dispatcher: Start moving blk_1661084121_587506756 with size=72776669 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 ... 16/10/19 16:34:12 WARN balancer.Dispatcher: No mover threads available: skip moving blk_1457593679_384005217 with size=104909643 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 ...
Here is the part of the log just after the last block has successfully been moved:
... 16/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693419961_619844350 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:101916/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693366190_619790579 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 16/10/19 19:04:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 21:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/20 00:04:11 INFO block.BlockTokenSecretManager: Setting block keys ...
In the above log sections I'm not showing the debug output since that is pretty verbose and from what I can see the only things mentioned is a periodic reauthentication of the ipc.Client.
I'm launching the balancer from command line using the following command:
$ hdfs --loglevel DEBUG balancer -D dfs.datanode.balance.bandwidthPerSec=200000000
I have tried other values of the bandwidth setting but it doesn't change the behaviour.
Can anyone see if I'm doing something wrong and point me towards a solution?
Best Regards
/Thomas
Created 10-25-2016 07:02 AM
We got it to work by lowering the "dfs.datanode.balance.max.concurrent.moves" from 500 to 20, which is more in line with the guide at https://community.hortonworks.com/articles/43849/hdfs-balancer-2-configurations-cli-options.html.
It's possible that we could also have gotten it to work by upping the dispatcher threads setting suggested by aengineer below but we didn't try that once we got this to work.
Created 10-20-2016 06:26 PM
The issue is due to this line
Please increase that to a large value ( I don't know the size of your cluster or the datanodes config)
something like this is what I would do
-Ddfs.balancer.moverThreads=10000 -Ddfs.balancer.dispatcherThreads=10000
Thanks
Anu
Created 10-25-2016 07:03 AM
Thanks for your reply Anu. We didn't get around to try your suggestion so I can't accept your answer unfortunately, even though it might be valid.
Created 10-25-2016 07:02 AM
We got it to work by lowering the "dfs.datanode.balance.max.concurrent.moves" from 500 to 20, which is more in line with the guide at https://community.hortonworks.com/articles/43849/hdfs-balancer-2-configurations-cli-options.html.
It's possible that we could also have gotten it to work by upping the dispatcher threads setting suggested by aengineer below but we didn't try that once we got this to work.