kudu service has 3 kudu master and 5 kudu tablet
among these 5 kudu tablet, 3 of them have high memory usage, 60GB has been set for each kudu tablet and it is using more than 85% to 90% of that memory and with that it ends up causing jobs failure
below shows the memory usage behavior:
TABLET SERVER: hostname04.local - USE MEMORY: 55.28G / %: 92.13%
TABLET SERVER: hostname05.local - USE MEMORY: 54.22G / %: 90.38%
TABLET SERVER: hostname06.local - USE MEMORY: 29.05G / %: 48.42%
TABLET SERVER: hostname07.local - USE MEMORY: 28.47G / %: 47.46%
TABLET SERVER: hostname08.local - USE MEMORY: 54.99G / %: 91.66%
memory usage graph of all kudu tablet in cloudera manager:
would like to understand why kudu is having this behavior and how can we reduce this memory usage?
PS: kudu service is installed and being managed by cloudera manager express in version 5.16.2 and kudu service is in version 1.7
The several possible reasons for this, including but not limited to: a too-low value for hard_memory_limit_bytes; too-large tablet sizes; too-high a workload (which is basically insufficient memory but when there is not any more room to increase it).
In order to tell more, we need to know more about the system. Can you tell me the following things about your cluster:
1) How many tablet replicas are there in your cluster, total?
2) How much data is in the kudu cluster? This should be available in the Charts section of Cloudera Manager.
3) How many fs_data_dirs are there for your Tablet Servers?
4) How many maintenance manager threads are there?
5) What is the current value of hard_memory_limit_bytes?
With this information I can identify whether you are exceeding any basic scaling limits, or have an inefficient arrangement of workload, or have some kind of basic bottleneck that slows things down.
1) - answer:
Masters | 3
Tablet Servers | 5
Tables | 48
Tablets | 481
Replicas | 1443
2) - answer:
According to the chart "Total Tablet Size On Disk Across Kudu Replicas", it has more than 710GB
3) - answer:
4) - answer:
exist 3 threads
5) - answer:
memory_limit_hard_bytes = 60GB
block_cache_capacity_mb = 15GB
These are all reasonable numbers by themselves, so no huge red flags here. My next question is: are the tablets balanced across the Tablet Servers? If you run a kudu cluster ksck, how many replicas does it show on each server?
One scenario where we might see something like this is if the cluster was made with 3 Tablet Servers, and then 2 more were added. The reason this would cause an imbalance is that only data which was added to the cluster after those nodes were added would be on those servers, which means fewer tablets running on them, which means less load on the memory.
The kudu rebalancer may be an option:
$ sudo -u kudu kudu cluster rebalance <master addresses>
yes, the tablets server are balanced, today there are 1443 replicas, dividing by 5 tablet server gives a total of 288 replicas per tablet server, as shown by each tablet server below:
hostname04.local - replicas by tablets - RUNNING-289
hostname05.local - replicas by tablets - RUNNING-288
hostname06.local - replicas by tablets - RUNNING-289
hostname07.local - replicas by tablets - RUNNING-289
hostname08.local - replicas by tablets - RUNNING-288
there have been times when host 06 was also used above 85% to 90% of memory usage.
when we restart the tablet server that have high memory usage, there are times when the tablet server itself goes back up the memory usage or another tablet server that has low memory usage starts to go up, getting above 85 % too.
@yagoaparecidoti If you notice that the high memory utilization moves from tablet server to tablet server, then another candidate is a problem with the schema of one or more tables. The specific symptom I am thinking of is that the tablet size may be too high due to too few partitions; this can drive high memory utilization because the of the amount of information that must be loaded into memory.
The fast way to determine whether this is the case is to look at the Charts for the Tablet Server with high memory utilization, and check the size of data across all tablets on that server. Then, divide this number by the number of replicas on the server, and this gives us an avg value of the replica size. It is perfectly possible for this problem to originate with a single table, so if you have one or more tables you know have few partitions, I recommend checking that table specifically.
Beyond this will require log analysis, and is better suited to a support case.