Support Questions

Find answers, ask questions, and share your expertise

Question on HDFS Rebalance

avatar
Expert Contributor

I am currently running HDFS rebalance in my prod environment and it has been running for more than 2 hours.

How do I know the % of completion of this job ?

I need to get an ETA on this one.

1 ACCEPTED SOLUTION

avatar
Guru

Ambari

Open the Background Operations (says Ops) in the upper left of Ambari and you will see the rebalance progress. Continually double clicking on the progress bar gives you greater details of how many blocks are being rebalanced.

If you go to [HDFS > QuickLinks > Namenode UI > Live Nodes link in body of page] you will see the hdfs capacity used on each node (and thus the imbalance). You can use this to estimate the time it takes to rebalance. If you used the default 10 for Balance Threshold, it will stop rebalancing when all nodes are within 10% of each other in terms of hdfs capacity used.

If you think it will take too long to rebalance, you can kill the job (go to namenode command line and run ps -aef|grep balancer kill -9) and then set the threshold higher, e.g. 25 (%) and run it again and it will rebalance faster. Next, you can rebalance again at 20% threshold, then at 15% etc. This will give you greater control on time and duration of rebalance.

CLI

If running from the command line, you should see the progress of each iteration in stdout. To determine estimated time to balance, use the same technique as above (go to Namenode UI and estimate remaining time from imbalance and amounts of block moving). To kill and restart at a higher balancer threshold, just Cntrl+C and run again.

View solution in original post

2 REPLIES 2

avatar
Guru

Ambari

Open the Background Operations (says Ops) in the upper left of Ambari and you will see the rebalance progress. Continually double clicking on the progress bar gives you greater details of how many blocks are being rebalanced.

If you go to [HDFS > QuickLinks > Namenode UI > Live Nodes link in body of page] you will see the hdfs capacity used on each node (and thus the imbalance). You can use this to estimate the time it takes to rebalance. If you used the default 10 for Balance Threshold, it will stop rebalancing when all nodes are within 10% of each other in terms of hdfs capacity used.

If you think it will take too long to rebalance, you can kill the job (go to namenode command line and run ps -aef|grep balancer kill -9) and then set the threshold higher, e.g. 25 (%) and run it again and it will rebalance faster. Next, you can rebalance again at 20% threshold, then at 15% etc. This will give you greater control on time and duration of rebalance.

CLI

If running from the command line, you should see the progress of each iteration in stdout. To determine estimated time to balance, use the same technique as above (go to Namenode UI and estimate remaining time from imbalance and amounts of block moving). To kill and restart at a higher balancer threshold, just Cntrl+C and run again.

avatar
Contributor

I also noticed you can monitor the "need to move" message for the remaining space to be balanced. This can go up or down depending on how busy the cluster is:

cat /tmp/hdfs_rebalancer.log | grep "Need to move" | tail -n 10
19/01/28 12:23:02 INFO balancer.Balancer: Need to move 11.11 TB to make the cluster balanced.
19/01/28 12:43:48 INFO balancer.Balancer: Need to move 11.10 TB to make the cluster balanced.
19/01/28 13:04:38 INFO balancer.Balancer: Need to move 10.89 TB to make the cluster balanced.
19/01/28 13:25:23 INFO balancer.Balancer: Need to move 10.83 TB to make the cluster balanced.
19/01/28 13:45:59 INFO balancer.Balancer: Need to move 10.83 TB to make the cluster balanced.
19/01/28 14:06:30 INFO balancer.Balancer: Need to move 10.78 TB to make the cluster balanced.
19/01/28 14:27:14 INFO balancer.Balancer: Need to move 10.73 TB to make the cluster balanced.
19/01/28 14:47:53 INFO balancer.Balancer: Need to move 10.70 TB to make the cluster balanced.
19/01/28 15:08:42 INFO balancer.Balancer: Need to move 10.66 TB to make the cluster balanced.
19/01/28 15:29:23 INFO balancer.Balancer: Need to move 10.75 TB to make the cluster balanced.