Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS balancer stops balancing without feedback and just seems to idle

avatar
Rising Star

Hello,

I'm trying to rebalance hdfs in our HDP 2.4.3 cluster (which is running namenode HA) and I am having a problem that the balancer only does actual work for a short time and then just sits and idles. If I kill the process and restart it, it will do some balancing immediately and then go into idle again. I have repeated this many times now.

I enabled debug logging for the balancer but I can't see anything in there that explains why it just stops balancing.

Here is the beginning of the log (since it shows some parameters that might be relevant):

16/10/19 16:34:10 INFO balancer.Balancer: namenodes  = [hdfs://PROD1]
16/10/19 16:34:10 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false]
16/10/19 16:34:10 INFO balancer.Balancer: included nodes = []
16/10/19 16:34:10 INFO balancer.Balancer: excluded nodes = []
16/10/19 16:34:10 INFO balancer.Balancer: source nodes = []
16/10/19 16:34:11 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys
16/10/19 16:34:11 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 500 (default=5)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys
16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
16/10/19 16:34:11 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728)
16/10/19 16:34:11 INFO net.NetworkTopology: Adding a new node: /default-rack/X.X.X.X:1019
....
16/10/19 16:34:11 INFO balancer.Balancer: Need to move 11.83 TB to make the cluster balanced.
...
16/10/19 16:34:11 INFO balancer.Balancer: Will move 120 GB in this iteration
16/10/19 16:34:11 INFO balancer.Dispatcher: Start moving blk_1661084121_587506756 with size=72776669 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019
...
16/10/19 16:34:12 WARN balancer.Dispatcher: No mover threads available: skip moving blk_1457593679_384005217 with size=104909643 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019
...

Here is the part of the log just after the last block has successfully been moved:

...
16/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693419961_619844350 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:101916/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693366190_619790579 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019
16/10/19 19:04:11 INFO block.BlockTokenSecretManager: Setting block keys
16/10/19 21:34:11 INFO block.BlockTokenSecretManager: Setting block keys
16/10/20 00:04:11 INFO block.BlockTokenSecretManager: Setting block keys
...

In the above log sections I'm not showing the debug output since that is pretty verbose and from what I can see the only things mentioned is a periodic reauthentication of the ipc.Client.

I'm launching the balancer from command line using the following command:

$ hdfs --loglevel DEBUG balancer -D dfs.datanode.balance.bandwidthPerSec=200000000 

I have tried other values of the bandwidth setting but it doesn't change the behaviour.

Can anyone see if I'm doing something wrong and point me towards a solution?

Best Regards

/Thomas

1 ACCEPTED SOLUTION

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Expert Contributor
From the logs it looks like the issue is that you are running out of dispatcher threads on the balancer.
  1. 16/10/1916:34:12 WARN balancer.Dispatcher:No mover threads available: skip moving blk_1457593679_384005217 with size=104909643from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019

The issue is due to this line

  1. 16/10/1916:34:11 INFO balancer.Balancer: dfs.balancer.dispatcherThreads =200(default=200)

Please increase that to a large value ( I don't know the size of your cluster or the datanodes config)

something like this is what I would do

 -Ddfs.balancer.moverThreads=10000 -Ddfs.balancer.dispatcherThreads=10000

Thanks

Anu

avatar
Rising Star

Thanks for your reply Anu. We didn't get around to try your suggestion so I can't accept your answer unfortunately, even though it might be valid.

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login