Cloudera Data Analytics (CDA) Articles

VidyaSargur · ‎06-10-2023

Summary

Within the blog Rebalance your mixed HDFS & Kudu Services, we demonstrated how to properly review and set up a mixed HDFS / Kudu shared services cluster.

Now it is time to review a method that allows you to confirm the distribution of data of your HDFS & Kudu services at the disk level of each and every worker node.

Investigation

Commands to check the balance of HDFS & Kudu

Log in as root to each worker node that is part of the HDFS and Kudu service, and perform the following commands.

Check overall disk capacity status:

[root@<Worker-Node> ~]# df -h /data/* | sed 1d | sort

/dev/sdb 1.9T 1.4T 478G 75% /data/1

/dev/sdc 1.9T 1.3T 560G 70% /data/2

/dev/sdd 1.9T 1.4T 513G 73% /data/3

/dev/sde 1.9T 1.4T 489G 74% /data/4

/dev/sdf 1.9T 1.4T 464G 76% /data/5

/dev/sdg 1.9T 1.4T 513G 73% /data/6

/dev/sdh 1.9T 1.4T 525G 72% /data/7

/dev/sdi 1.9T 1.4T 466G 76% /data/8

/dev/sdj 1.9T 1.3T 538G 72% /data/9

/dev/sdk 1.9T 1.5T 418G 78% /data/10

/dev/sdl 1.9T 1.3T 617G 67% /data/11

/dev/sdm 1.9T 1.3T 572G 70% /data/12

/dev/sdn 1.9T 1.4T 474G 75% /data/13

/dev/sdo 1.9T 1.3T 534G 72% /data/14

/dev/sdp 1.9T 1.4T 468G 75% /data/15

/dev/sdq 1.9T 1.4T 470G 75% /data/16

/dev/sdr 1.9T 1.4T 466G 75% /data/17

/dev/sds 1.9T 1.4T 468G 75% /data/18

/dev/sdt 1.9T 1.4T 473G 75% /data/19

/dev/sdu 1.9T 1.4T 474G 75% /data/20

/dev/sdv 1.9T 1.4T 467G 75% /data/21

/dev/sdw 1.9T 1.4T 474G 75% /data/22

/dev/sdx 1.9T 1.4T 473G 75% /data/23

/dev/sdy 1.9T 1.4T 477G 75% /data/24

Check overall HDFS disk capacity status:

[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/dfs | sort -t/ -k3,3n

606G /data/1/dfs

612G /data/2/dfs

608G /data/3/dfs

609G /data/4/dfs

610G /data/5/dfs

619G /data/6/dfs

613G /data/7/dfs

634G /data/8/dfs

590G /data/9/dfs

681G /data/10/dfs

618G /data/11/dfs

621G /data/12/dfs

1.2T /data/13/dfs

1.1T /data/14/dfs

1.2T /data/15/dfs

1.2T /data/16/dfs

1.2T /data/17/dfs

1.2T /data/18/dfs

1.2T /data/19/dfs

1.2T /data/20/dfs

1.2T /data/21/dfs

1.2T /data/22/dfs

1.2T /data/23/dfs

1.2T /data/24/dfs

Check overall Kudu disk capacity status:

[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/kudu | sort -t/ -k3,3n

745G /data/1/kudu

691G /data/2/kudu

741G /data/3/kudu

765G /data/4/kudu

788G /data/5/kudu

730G /data/6/kudu

725G /data/7/kudu

763G /data/8/kudu

734G /data/9/kudu

768G /data/10/kudu

628G /data/11/kudu

669G /data/12/kudu

205G /data/13/kudu

204G /data/14/kudu

205G /data/15/kudu

208G /data/16/kudu

209G /data/17/kudu

205G /data/18/kudu

204G /data/19/kudu

204G /data/20/kudu

206G /data/21/kudu

203G /data/22/kudu

194G /data/23/kudu

200G /data/24/kudu

Now collate all of the information retrieved from the Worker Node into an easy to read format so that it is easy for you to observe out of sync characteristics at the Worker Node layer.

Worker Node Balance Example

Taking the output from Commands to check the balance of HDFS & Kudu, here is an example of how you might collate the information into a format that makes it easy to notice data balance issues at the Worker Node level.

	Worker-Node	du -h	du -h
Disk	Size Used Avail Use%	dfs	kudu
/data/1	1.9T 1.7T 164G 92%	414G	1.3T
/data/2	1.9T 1.5T 395G 79%	499G	970G
/data/3	1.9T 1.5T 351G 82%	487G	1022G
/data/4	1.9T 1.5T 338G 82%	493G	1.1T
/data/5	1.9T 1.5T 352G 82%	486G	1.1T
/data/6	1.9T 1.5T 337G 82%	498G	1.1T
/data/7	1.9T 1.5T 337G 82%	485G	1.1T
/data/8	1.9T 1.5T 350G 82%	494G	1018G
/data/9	1.9T 1.5T 339G 82%	475G	1.1T
/data/10	1.9T 1.5T 391G 80%	487G	985G
/data/11	1.9T 1.5T 338G 82%	487G	1.1T
/data/12	1.9T 1.6T 320G 83%	475G	1.1T
/data/13	1.9T 1.2T 688G 64%	1.2T	353M
/data/14	1.9T 1.2T 679G 64%	1.2T	8.5G
/data/15	1.9T 1.2T 674G 64%	1.2T	13G
/data/16	1.9T 1.2T 678G 64%	1.2T	8.0G
/data/17	1.9T 1.2T 686G 64%	1.2T	8.0K
/data/18	1.9T 1.2T 680G 64%	1.2T	5.4G
/data/19	1.9T 1.2T 694G 63%	1.2T	33M
/data/20	1.9T 1.2T 688G 64%	1.2T	8.0K
/data/21	1.9T 1.2T 689G 64%	1.2T	8.0K
/data/22	1.9T 1.2T 686G 64%	1.2T	129M
/data/23	1.9T 1.2T 679G 64%	1.2T	7.4G
/data/24	1.9T 1.2T 684G 64%	1.2T	33M

If you have set up the mixed HDFS & Kudu configuration sometime after it was deployed, you are likely to encounter node level disk capacity issues with the Kudu Rebalance command.

This is due to how the Kudu Rebalancer, currently, is unaware of total disk capacity, or current used disk capacity.

Analyze the Data Distribution

Within the Worker Node Balance Example, we can see that we have 24 disks, all of them highlighting that there is a fundamental imbalance between them.

All 24 disks in the example are configured within HDFS and Kudu, but the HDFS & Kudu configuration alignment happened after the cluster had been used for many years.

Note how out of sync they are:

Disks 1-12 are far more utilized than Disks 13-24. This can happen due to:

Adding an extra 12 disks to the node at some point
Configuration of the Kudu Tablet Server Role Group performed later than the node was deployed into HDFS / Kudu

Disk 1 is at 92%

If left unchecked, every service in the cluster that uses the data disks will be affected when this disk reaches 100%
Depending on how you monitor disk level utilization at a per node level, the overall capacity of the node will not reflect that this single disk is nearly fully utilized.

There are other scenarios that can cause a similar imbalance. Failed disks being replaced and then the HDFS and Kudu Rebalancing activities remain focused only at the service level

Resolution

Whether it is down to a later commitment or alignment of HDFS or Kudu configuration, which is focused on the disk distribution, or it’s just a cluster that has had countless disks replaced over time and had 0 local disk balancing methods applied afterward - it’s time for us to illustrate how to handle these issues.

There are several blogs that can help you with this:

Replace your failed Worker Node disks
Rebalance your HDFS Disks (single node)
Rebalance your Kudu Disks (single node)

Cloudera Community