Created on 06-10-2023 02:56 AM - edited on 06-12-2023 11:15 PM by VidyaSargur
Within the blog Rebalance your mixed HDFS & Kudu Services, we demonstrated how to properly review and set up a mixed HDFS / Kudu shared services cluster.
Now it is time to review a method that allows you to confirm the distribution of data of your HDFS & Kudu services at the disk level of each and every worker node.
Log in as root to each worker node that is part of the HDFS and Kudu service, and perform the following commands.
Check overall disk capacity status:
[root@<Worker-Node> ~]# df -h /data/* | sed 1d | sort /dev/sdb 1.9T 1.4T 478G 75% /data/1 /dev/sdc 1.9T 1.3T 560G 70% /data/2 /dev/sdd 1.9T 1.4T 513G 73% /data/3 /dev/sde 1.9T 1.4T 489G 74% /data/4 /dev/sdf 1.9T 1.4T 464G 76% /data/5 /dev/sdg 1.9T 1.4T 513G 73% /data/6 /dev/sdh 1.9T 1.4T 525G 72% /data/7 /dev/sdi 1.9T 1.4T 466G 76% /data/8 /dev/sdj 1.9T 1.3T 538G 72% /data/9 /dev/sdk 1.9T 1.5T 418G 78% /data/10 /dev/sdl 1.9T 1.3T 617G 67% /data/11 /dev/sdm 1.9T 1.3T 572G 70% /data/12 /dev/sdn 1.9T 1.4T 474G 75% /data/13 /dev/sdo 1.9T 1.3T 534G 72% /data/14 /dev/sdp 1.9T 1.4T 468G 75% /data/15 /dev/sdq 1.9T 1.4T 470G 75% /data/16 /dev/sdr 1.9T 1.4T 466G 75% /data/17 /dev/sds 1.9T 1.4T 468G 75% /data/18 /dev/sdt 1.9T 1.4T 473G 75% /data/19 /dev/sdu 1.9T 1.4T 474G 75% /data/20 /dev/sdv 1.9T 1.4T 467G 75% /data/21 /dev/sdw 1.9T 1.4T 474G 75% /data/22 /dev/sdx 1.9T 1.4T 473G 75% /data/23 /dev/sdy 1.9T 1.4T 477G 75% /data/24 |
Check overall HDFS disk capacity status:
[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/dfs | sort -t/ -k3,3n 606G /data/1/dfs 612G /data/2/dfs 608G /data/3/dfs 609G /data/4/dfs 610G /data/5/dfs 619G /data/6/dfs 613G /data/7/dfs 634G /data/8/dfs 590G /data/9/dfs 681G /data/10/dfs 618G /data/11/dfs 621G /data/12/dfs 1.2T /data/13/dfs 1.1T /data/14/dfs 1.2T /data/15/dfs 1.2T /data/16/dfs 1.2T /data/17/dfs 1.2T /data/18/dfs 1.2T /data/19/dfs 1.2T /data/20/dfs 1.2T /data/21/dfs 1.2T /data/22/dfs 1.2T /data/23/dfs 1.2T /data/24/dfs |
Check overall Kudu disk capacity status:
[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/kudu | sort -t/ -k3,3n 745G /data/1/kudu 691G /data/2/kudu 741G /data/3/kudu 765G /data/4/kudu 788G /data/5/kudu 730G /data/6/kudu 725G /data/7/kudu 763G /data/8/kudu 734G /data/9/kudu 768G /data/10/kudu 628G /data/11/kudu 669G /data/12/kudu 205G /data/13/kudu 204G /data/14/kudu 205G /data/15/kudu 208G /data/16/kudu 209G /data/17/kudu 205G /data/18/kudu 204G /data/19/kudu 204G /data/20/kudu 206G /data/21/kudu 203G /data/22/kudu 194G /data/23/kudu 200G /data/24/kudu |
Now collate all of the information retrieved from the Worker Node into an easy to read format so that it is easy for you to observe out of sync characteristics at the Worker Node layer.
Taking the output from Commands to check the balance of HDFS & Kudu, here is an example of how you might collate the information into a format that makes it easy to notice data balance issues at the Worker Node level.
Worker-Node |
du -h |
du -h |
|
Disk |
Size Used Avail Use% |
dfs |
kudu |
/data/1 |
1.9T 1.7T 164G 92% |
414G |
1.3T |
/data/2 |
1.9T 1.5T 395G 79% |
499G |
970G |
/data/3 |
1.9T 1.5T 351G 82% |
487G |
1022G |
/data/4 |
1.9T 1.5T 338G 82% |
493G |
1.1T |
/data/5 |
1.9T 1.5T 352G 82% |
486G |
1.1T |
/data/6 |
1.9T 1.5T 337G 82% |
498G |
1.1T |
/data/7 |
1.9T 1.5T 337G 82% |
485G |
1.1T |
/data/8 |
1.9T 1.5T 350G 82% |
494G |
1018G |
/data/9 |
1.9T 1.5T 339G 82% |
475G |
1.1T |
/data/10 |
1.9T 1.5T 391G 80% |
487G |
985G |
/data/11 |
1.9T 1.5T 338G 82% |
487G |
1.1T |
/data/12 |
1.9T 1.6T 320G 83% |
475G |
1.1T |
/data/13 |
1.9T 1.2T 688G 64% |
1.2T |
353M |
/data/14 |
1.9T 1.2T 679G 64% |
1.2T |
8.5G |
/data/15 |
1.9T 1.2T 674G 64% |
1.2T |
13G |
/data/16 |
1.9T 1.2T 678G 64% |
1.2T |
8.0G |
/data/17 |
1.9T 1.2T 686G 64% |
1.2T |
8.0K |
/data/18 |
1.9T 1.2T 680G 64% |
1.2T |
5.4G |
/data/19 |
1.9T 1.2T 694G 63% |
1.2T |
33M |
/data/20 |
1.9T 1.2T 688G 64% |
1.2T |
8.0K |
/data/21 |
1.9T 1.2T 689G 64% |
1.2T |
8.0K |
/data/22 |
1.9T 1.2T 686G 64% |
1.2T |
129M |
/data/23 |
1.9T 1.2T 679G 64% |
1.2T |
7.4G |
/data/24 |
1.9T 1.2T 684G 64% |
1.2T |
33M |
If you have set up the mixed HDFS & Kudu configuration sometime after it was deployed, you are likely to encounter node level disk capacity issues with the Kudu Rebalance command.
This is due to how the Kudu Rebalancer, currently, is unaware of total disk capacity, or current used disk capacity.
Within the Worker Node Balance Example, we can see that we have 24 disks, all of them highlighting that there is a fundamental imbalance between them.
All 24 disks in the example are configured within HDFS and Kudu, but the HDFS & Kudu configuration alignment happened after the cluster had been used for many years.
Note how out of sync they are:
There are other scenarios that can cause a similar imbalance. Failed disks being replaced and then the HDFS and Kudu Rebalancing activities remain focused only at the service level
Whether it is down to a later commitment or alignment of HDFS or Kudu configuration, which is focused on the disk distribution, or it’s just a cluster that has had countless disks replaced over time and had 0 local disk balancing methods applied afterward - it’s time for us to illustrate how to handle these issues.
There are several blogs that can help you with this: