Created on 12-20-2017 09:00 PM - edited 08-17-2019 05:41 PM
Hi,
I have a problem with rebalancing HDFS after adding new DataNode to cluter. In my configuration I had 4 DataNodes and added new one (5th).
Below is report from dfsadmin
[hdfs@snr-prod-master0 ~]$ hdfs dfsadmin -report Configured Capacity: 21563228579840 (19.61 TB) Present Capacity: 20460562895805 (18.61 TB) DFS Remaining: 20290148094909 (18.45 TB) DFS Used: 170414800896 (158.71 GB) DFS Used%: 0.83% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (5): Name: 172.17.2.61:50010 (snr-prod-slave1) Hostname: snr-prod-slave1 Decommission Status : Normal Configured Capacity: 4312645715968 (3.92 TB) DFS Used: 35358969856 (32.93 GB) Non DFS Used: 0 (0 B) DFS Remaining: 4056646234773 (3.69 TB) DFS Used%: 0.82% DFS Remaining%: 94.06% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 12 Last contact: Wed Dec 20 20:52:23 UTC 2017 Name: 172.17.2.64:50010 (snr-prod-slave4) Hostname: snr-prod-slave4 Decommission Status : Normal Configured Capacity: 4312645715968 (3.92 TB) DFS Used: 47864344576 (44.58 GB) Non DFS Used: 0 (0 B) DFS Remaining: 4044275077691 (3.68 TB) DFS Used%: 1.11% DFS Remaining%: 93.78% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 10 Last contact: Wed Dec 20 20:52:26 UTC 2017 Name: 172.17.2.62:50010 (snr-prod-slave2) Hostname: snr-prod-slave2 Decommission Status : Normal Configured Capacity: 4312645715968 (3.92 TB) DFS Used: 221184 (216 KB) Non DFS Used: 0 (0 B) DFS Remaining: 4092407638196 (3.72 TB) DFS Used%: 0.00% DFS Remaining%: 94.89% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 6 Last contact: Wed Dec 20 20:52:26 UTC 2017 Name: 172.17.2.65:50010 (snr-prod-slave5) Hostname: snr-prod-slave5 Decommission Status : Normal Configured Capacity: 4312645715968 (3.92 TB) DFS Used: 44406976512 (41.36 GB) Non DFS Used: 0 (0 B) DFS Remaining: 4047866664447 (3.68 TB) DFS Used%: 1.03% DFS Remaining%: 93.86% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 8 Last contact: Wed Dec 20 20:52:23 UTC 2017 Name: 172.17.2.60:50010 (snr-prod-slave0) Hostname: snr-prod-slave0 Decommission Status : Normal Configured Capacity: 4312645715968 (3.92 TB) DFS Used: 42784288768 (39.85 GB) Non DFS Used: 0 (0 B) DFS Remaining: 4048952479802 (3.68 TB) DFS Used%: 0.99% DFS Remaining%: 93.89% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 16 Last contact: Wed Dec 20 20:52:23 UTC 2017
And after adding new node to cluster i have run rebalance operation, to distribute data equally, but it says it is balanced (The cluster is balanced. Exiting...)
[hdfs@snr-prod-master0 ~]$ hdfs balancer -threshold 5 17/12/20 20:57:36 INFO balancer.Balancer: Using a threshold of 5.0 17/12/20 20:57:36 INFO balancer.Balancer: namenodes = [hdfs://snr-prod-master0:8020] 17/12/20 20:57:36 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false] 17/12/20 20:57:36 INFO balancer.Balancer: included nodes = [] 17/12/20 20:57:36 INFO balancer.Balancer: excluded nodes = [] 17/12/20 20:57:36 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 17/12/20 20:57:37 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 17/12/20 20:57:38 INFO block.BlockTokenSecretManager: Setting block keys 17/12/20 20:57:38 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 5 (default=5) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 17/12/20 20:57:38 INFO block.BlockTokenSecretManager: Setting block keys 17/12/20 20:57:38 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 17/12/20 20:57:38 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 17/12/20 20:57:38 INFO net.NetworkTopology: Adding a new node: /default-rack/172.17.2.61:50010 17/12/20 20:57:38 INFO net.NetworkTopology: Adding a new node: /default-rack/172.17.2.60:50010 17/12/20 20:57:38 INFO net.NetworkTopology: Adding a new node: /default-rack/172.17.2.64:50010 17/12/20 20:57:38 INFO net.NetworkTopology: Adding a new node: /default-rack/172.17.2.62:50010 17/12/20 20:57:38 INFO net.NetworkTopology: Adding a new node: /default-rack/172.17.2.65:50010 17/12/20 20:57:38 INFO balancer.Balancer: 0 over-utilized: [] 17/12/20 20:57:38 INFO balancer.Balancer: 0 underutilized: [] The cluster is balanced. Exiting... Dec 20, 2017 8:57:38 PM 0 0 B 0 B 0 B Dec 20, 2017 8:57:38 PM Balancing took 1.714 seconds
What am i missing?
Thanks for reply!
Robert
Created 12-20-2017 10:58 PM
Hi @Robert Jonczy,
The report you got is accurate, as I would like to stress on the parameter you have used "threshold"
-threshold <threshold>Percentage of disk capacity.
this is the value that balancer considered to have + or - of the percent of the "average DFS usage" to be moved
which is : % of DFS Used / total capacity
In your scenario it is < almost 1%, the threshold you specified (5%) which only works if there is a difference of 10%( +/- 5%) [not in your case ], hence it is not balancing anymore the data.
Hope this clarifies !!
Created 12-20-2017 10:51 PM
Today the HDFS balancer doesn't balance disks within a DataNode
This is a pretty known and talked about issue with HDFS balancer.
See apache jira - https://issues.apache.org/jira/browse/HDFS-1312
The apache documentation is - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html
This tracks and resolve this issue.
Created 12-20-2017 10:58 PM
Hi @Robert Jonczy,
The report you got is accurate, as I would like to stress on the parameter you have used "threshold"
-threshold <threshold>Percentage of disk capacity.
this is the value that balancer considered to have + or - of the percent of the "average DFS usage" to be moved
which is : % of DFS Used / total capacity
In your scenario it is < almost 1%, the threshold you specified (5%) which only works if there is a difference of 10%( +/- 5%) [not in your case ], hence it is not balancing anymore the data.
Hope this clarifies !!
Created 12-21-2017 08:52 AM
@bkosaraju. Your explanations makes sense. Thanks for clarifying! My understanding about threshold was different.
Robert