Created 03-28-2022 12:56 PM
We would like to understand two behaviors that Kudu is showing.
1. When restarting the Kudu Tablet, it takes between 5 and 10min for this Kudu Tablet to communicate with the Kudu Master again, why is this happening?
2. There are 5 Kudu Tablets and two of those Kudu Tablets is using more than 90% of the memory that was set in "memory_limit_hard_bytes", why is this happening?
Created 03-30-2022 12:27 PM
Thanks, in that case further investigation will be needed, we would need to check what is happening in those 2 tablet servers. If you are able to share the logs from those TS that would be great, if not it will be quite hard to tell and your best bet would be to open a support case to have it checked.
> Are you able to check the charts on cloudera manager > Kudu > instances > tablet server > Chart library > Replicas? Can you compare those with a non affected TS?
Created 03-28-2022 02:01 PM
With regards to your questions:
1. To further understand this behavior, it would be needed to check the tablet server logs during startup. Likely the block processing time is taking a good part of those 10 mins.
2. It could be possible that those 2 servers are overloaded, that can happen if the kudu cluster is not balanced.
To further understand, the output of the ksck and:
kudu table list ${MASTER_ADDRS} -list_tablets | grep "^ " | cut -d' ' -f6,7 | sort
Would throw a good output to gain a better understanding of your cluster situation.
If you can attach those here, would be great for further analysis.
Created 03-29-2022 08:54 AM
hi @jromero
kudu is balanced correctly.
there are 5 kudu tablet and a total of 1173 replicas and each kudu tablet has 234 tablets, this shows that it is balanced.
even being balanced, it presents these two mentioned problems.
Created 03-29-2022 01:53 PM
Hellio @yagoaparecidoti
Being balanced is a good thing. How about hotspotting?
Can you check in the kudu web page > Tablet servers > Click on those affected > check the metrics and the RPC pages on those two.
> How does it look?
> How many RPC calls compared to a healthy tablet server?
Created 03-29-2022 02:03 PM
Hi @jromero
the "metrics[1]" and "rpcs[2]" web pages are too long:
what should we actually look at?
what parameter do we need to look at?
[1] - http://ip_host:8050/metrics
[2] - http://ip_host:8050/rpcz
Created 03-29-2022 02:28 PM
Inbound/Outbound connections on the rpc page.
rpc_* metrics on the metrics page. Sorry I don't recall the exact metric names.
Created 03-29-2022 04:53 PM
hi @jromero
on the page "http://ip_host:8050/rpcz":
in "inbound_connections" shows the addresses of other tablet servers with "open" state.
in "outbound_connections" it shows the addresses of other tablet servers and master with "open" state.
this above also happens with other tablet servers.
I didn't find anything that could be related to the mentioned problems.
Created 03-30-2022 12:27 PM
Thanks, in that case further investigation will be needed, we would need to check what is happening in those 2 tablet servers. If you are able to share the logs from those TS that would be great, if not it will be quite hard to tell and your best bet would be to open a support case to have it checked.
> Are you able to check the charts on cloudera manager > Kudu > instances > tablet server > Chart library > Replicas? Can you compare those with a non affected TS?
Created 04-04-2022 08:40 AM
hi @jromero
thanks for the feedback.
unfortunately we cannot make the TS logs available as they contain sensitive information.
we will try to open a ticket with support.
checking the TS replica charts "Total Tablet Size On Disk Across Kudu Replicas", they all contain the same size.
Created 04-03-2022 11:32 PM
@yagoaparecidoti, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. If you are still experiencing the issue, can you provide the information @jromero has requested?
Regards,
Vidya Sargur,