Cloudera Data Analytics (CDA) Articles

Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar
Cloudera Employee

Summary

Within the blog Rebalance your mixed HDFS & Kudu Services, we demonstrated how to properly review and set up a mixed HDFS / Kudu shared services cluster.

 

Now it is time to review a method that allows you to confirm the distribution of data of your HDFS & Kudu services at the disk level of each and every worker node.

Investigation

Commands to check the balance of HDFS & Kudu

Log in as root to each worker node that is part of the HDFS and Kudu service, and perform the following commands.

Check overall disk capacity status:

[root@<Worker-Node> ~]# df -h /data/* | sed 1d | sort

/dev/sdb        1.9T  1.4T  478G  75% /data/1

/dev/sdc        1.9T  1.3T  560G  70% /data/2

/dev/sdd        1.9T  1.4T  513G  73% /data/3

/dev/sde        1.9T  1.4T  489G  74% /data/4

/dev/sdf        1.9T  1.4T  464G  76% /data/5

/dev/sdg        1.9T  1.4T  513G  73% /data/6

/dev/sdh        1.9T  1.4T  525G  72% /data/7

/dev/sdi        1.9T  1.4T  466G  76% /data/8

/dev/sdj        1.9T  1.3T  538G  72% /data/9

/dev/sdk        1.9T  1.5T  418G  78% /data/10

/dev/sdl        1.9T  1.3T  617G  67% /data/11

/dev/sdm        1.9T  1.3T  572G  70% /data/12

/dev/sdn        1.9T  1.4T  474G  75% /data/13

/dev/sdo        1.9T  1.3T  534G  72% /data/14

/dev/sdp        1.9T  1.4T  468G  75% /data/15

/dev/sdq        1.9T  1.4T  470G  75% /data/16

/dev/sdr        1.9T  1.4T  466G  75% /data/17

/dev/sds        1.9T  1.4T  468G  75% /data/18

/dev/sdt        1.9T  1.4T  473G  75% /data/19

/dev/sdu        1.9T  1.4T  474G  75% /data/20

/dev/sdv        1.9T  1.4T  467G  75% /data/21

/dev/sdw        1.9T  1.4T  474G  75% /data/22

/dev/sdx        1.9T  1.4T  473G  75% /data/23

/dev/sdy        1.9T  1.4T  477G  75% /data/24

 

Check overall HDFS disk capacity status:

[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/dfs | sort -t/ -k3,3n

606G /data/1/dfs

612G /data/2/dfs

608G /data/3/dfs

609G /data/4/dfs

610G /data/5/dfs

619G /data/6/dfs

613G /data/7/dfs

634G /data/8/dfs

590G /data/9/dfs

681G /data/10/dfs

618G /data/11/dfs

621G /data/12/dfs

1.2T /data/13/dfs

1.1T /data/14/dfs

1.2T /data/15/dfs

1.2T /data/16/dfs

1.2T /data/17/dfs

1.2T /data/18/dfs

1.2T /data/19/dfs

1.2T /data/20/dfs

1.2T /data/21/dfs

1.2T /data/22/dfs

1.2T /data/23/dfs

1.2T /data/24/dfs

Check overall Kudu disk capacity status:

[root@<Worker-Node> ~]# du -h --max-depth=0 /data/*/kudu | sort -t/ -k3,3n

745G /data/1/kudu

691G /data/2/kudu

741G /data/3/kudu

765G /data/4/kudu

788G /data/5/kudu

730G /data/6/kudu

725G /data/7/kudu

763G /data/8/kudu

734G /data/9/kudu

768G /data/10/kudu

628G /data/11/kudu

669G /data/12/kudu

205G /data/13/kudu

204G /data/14/kudu

205G /data/15/kudu

208G /data/16/kudu

209G /data/17/kudu

205G /data/18/kudu

204G /data/19/kudu

204G /data/20/kudu

206G /data/21/kudu

203G /data/22/kudu

194G /data/23/kudu

200G /data/24/kudu

Now collate all of the information retrieved from the Worker Node into an easy to read format so that it is easy for you to observe out of sync characteristics at the Worker Node layer.

Worker Node Balance Example

Taking the output from Commands to check the balance of HDFS & Kudu, here is an example of how you might collate the information into a format that makes it easy to notice data balance issues at the Worker Node level.

 

Worker-Node

du -h

du -h

Disk

Size Used Avail Use%

dfs

kudu

/data/1

1.9T 1.7T 164G 92%

414G

1.3T

/data/2

1.9T 1.5T 395G 79%

499G

970G

/data/3

1.9T 1.5T 351G 82%

487G

1022G

/data/4

1.9T 1.5T 338G 82%

493G

1.1T

/data/5

1.9T 1.5T 352G 82%

486G

1.1T

/data/6

1.9T 1.5T 337G 82%

498G

1.1T

/data/7

1.9T 1.5T 337G 82%

485G

1.1T

/data/8

1.9T 1.5T 350G 82%

494G

1018G

/data/9

1.9T 1.5T 339G 82%

475G

1.1T

/data/10

1.9T 1.5T 391G 80%

487G

985G

/data/11

1.9T 1.5T 338G 82%

487G

1.1T

/data/12

1.9T 1.6T 320G 83%

475G

1.1T

/data/13

1.9T 1.2T 688G 64%

1.2T

353M

/data/14

1.9T 1.2T 679G 64%

1.2T

8.5G

/data/15

1.9T 1.2T 674G 64%

1.2T

13G

/data/16

1.9T 1.2T 678G 64%

1.2T

8.0G

/data/17

1.9T 1.2T 686G 64%

1.2T

8.0K

/data/18

1.9T 1.2T 680G 64%

1.2T

5.4G

/data/19

1.9T 1.2T 694G 63%

1.2T

33M

/data/20

1.9T 1.2T 688G 64%

1.2T

8.0K

/data/21

1.9T 1.2T 689G 64%

1.2T

8.0K

/data/22

1.9T 1.2T 686G 64%

1.2T

129M

/data/23

1.9T 1.2T 679G 64%

1.2T

7.4G

/data/24

1.9T 1.2T 684G 64%

1.2T

33M

If you have set up the mixed HDFS & Kudu configuration sometime after it was deployed, you are likely to encounter node level disk capacity issues with the Kudu Rebalance command.

This is due to how the Kudu Rebalancer, currently, is unaware of total disk capacity, or current used disk capacity.

Analyze the Data Distribution

Within the Worker Node Balance Example, we can see that we have 24 disks, all of them highlighting that there is a fundamental imbalance between them.

 

All 24 disks in the example are configured within HDFS and Kudu, but the HDFS & Kudu configuration alignment happened after the cluster had been used for many years.

 

Note how out of sync they are:

  • Disks 1-12 are far more utilized than Disks 13-24.  This can happen due to:
    • Adding an extra 12 disks to the node at some point
    • Configuration of the Kudu Tablet Server Role Group performed later than the node was deployed into HDFS / Kudu
  • Disk 1 is at 92%
    • If left unchecked, every service in the cluster that uses the data disks will be affected when this disk reaches 100%
    • Depending on how you monitor disk level utilization at a per node level, the overall capacity of the node will not reflect that this single disk is nearly fully utilized.

There are other scenarios that can cause a similar imbalance.  Failed disks being replaced and then the HDFS and Kudu Rebalancing activities remain focused only at the service level

Resolution

Whether it is down to a later commitment or alignment of HDFS or Kudu configuration, which is focused on the disk distribution, or it’s just a cluster that has had countless disks replaced over time and had 0 local disk balancing methods applied afterward - it’s time for us to illustrate how to handle these issues.

 

There are several blogs that can help you with this:

  • Replace your failed Worker Node disks
  • Rebalance your HDFS Disks (single node)
  • Rebalance your Kudu Disks (single node)
354 Views
0 Kudos