Created on 06-10-2023 02:52 AM - edited on 06-12-2023 11:11 PM by VidyaSargur
Cloudera SMEs agree that managing and balancing both HDFS and Kudu storage services on the same cluster can be complex. If you are experiencing capacity issues across your worker node layer that stems from a wide range of hardware heterogeneity, or a Kudu rebalancer that breaches 100% capacity, here are some best practices to consider.
First, let’s recognize how the 2 storage services differ with respect to rebalancing.
HDFS is a well-developed service that considers the ultimate total capacity of the worker node when it is calculating the utilization of each worker node. It also prevents a worker node from being written to when it reaches 95%. This means that at a worker node level:
Kudu does not currently consider the ultimate total capacity of the worker node, nor does it prevent a worker node from being written to at any point. This means at a worker node level :
The calculation for how much data is placed onto each Tablet Server during Kudu Rebalancing:
Total capacity used within Kudu / Number of Kudu Tablet Servers = Worker Node capacity utilized by Kudu |
Because of the differing rebalancing characteristics, managing both storage services in the same cluster can be complex.
If you have enough data within Kudu where the balancing logic needs to place 20TB of data on each tablet server, and that worker node only has 25TB of total available capacity - that worker node is already 80% full, even before HDFS tries to use the same service to store data.
That calls for an assessment of the Disk Configuration strategy.
One of the first things to consider when you have HDFS & Kudu in the same cluster is whether or not you choose to stripe the data from both services across all disks. Take two examples for a 24-disk worker node within a cluster that is running both an HDFS DataNode and a Kudu Tablet Server.
For example, assign:
Enforce the following characteristics:
For example, assign:
Enforce the following characteristics:
Allow both HDFS & Kudu to utilize all disks as this allows:
After choosing how to manage the physical disks at the hardware level, let’s now determine how we resolve the extreme example illustration that is highlighted within Rebalancing Characteristics - Summary
During the exercise of reviewing what might be possible within either the Kudu or HDFS service, when considering sharing the same servers and disks, we find that:
The description of this parameter within Cloudera Manager - “Reserved space in bytes per volume for non-Distributed File System (DFS) use”
"Non-DFS Used" is calculated by the following formula:
The following formula is also true:
So, we need to reform our formula to fully calculate “Non DFS Used:
The best way to illustrate an example is to think of a single disk. The logic for the single disk then scales out into any number of variations in the configuration of disks at the worker node layer.
Assuming we have a 100GB disk, and:
If you run df -h, you will see the available space is 50GB for that disk volume.
In HDFS web UI, it will show:
You initially configured to reserve 30GB for non dfs usage, theoretically leaving the remaining 70GB for HDFS/Kudu. However, the Non DFS Usage exceeds the theoretical 30GB reservation for Non DFS and consumes an additional 10GB space which the HDFS/Kudu are both expecting to have available.
It is also important to note that Non DFS Used is not a hard limit or quota, nor is it able to become one. All the same, this parameter and the analysis that we have performed confirm that it was a viable option to develop a method/design for a mixed HDFS / Kudu cluster.
It’s now time to make some amendments to allow the HDFS and Kudu services to work together more harmoniously…
When using Kudu in a heterogeneous cluster, alongside HDFS, and sharing the disks between both services, “Reserved Space for Non DFS Use” is critical.
You must calculate the anticipated Kudu use per disk upon the ideal 100% Kudu rebalance scenario, and then, set the Non DFS parameter in a mathematical way.
The first thing is to retrieve the current use capacity of the Kudu service. Example:
The calculation to identify the anticipated Kudu use per disk after a Rebalance:
Some key points:
After confirming that the Role Groups are correctly configured, you can use the following calculation examples to determine the Non DFS used values you need for your own cluster:
In order to highlight the characteristics that you would see if the assigned Kudu capacity, (via the HDFS configuration - Non DFS Used), was less than the utilized capacity, please refer to the following screenshot example:
The illustration highlights that 52.9TB Non DFS use is present in the cluster, beyond the configured settings. We are highlighting this as this is a great way to indicate that the Non DFS Used configuration will need further tuning. It’s an early indicator that the data in your Kudu service is growing beyond the original tuning or design.
Once you have performed the activities within this blog to reconsider how your HDFS and Kudu services are configured, you’ll need to then Rebalance both HDFS and Kudu.
Go to CM - Kudu - Actions - Run Kudu Rebalancer Tool:
Go to CM - Kudu - Actions - Rebalance
Some key notes about performing the rebalancing activities after setting the services/disks up: