Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Kudu Tablet Server High CPU

SOLVED Go to solution

Kudu Tablet Server High CPU

New Contributor

I have a cluster with 3 tservers.  When running a workload that heavily reads from the cluster, 1 of the 3 tservers is reaching nearly 100% CPU utilization while the other two are less than 10%.  The tablets are equally balanced amongst the 3.  

 

I am thinking that by chance, all my data used in this particular workload happens to reside on the 1 tserver. 

 

thoughts?  How might I diagnose this further?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Kudu Tablet Server High CPU

Contributor
No, the rebalancer doesn't fix leader skew. It may in a future release. Leaders can cluster onto one tserver when individual tservers are restarted; if you restart the entire cluster all at once you might be able to redistribute leadership more evenly.

You're right that if you're only using one host to initiate reads, the reads will go to the local tserver rather than round-robin across the cluster. The master doesn't directly tell where clients to scan; it just provides them with enough information to make that decision based on their replica selection policy. There's also no way to do round robin (or randomized) replica selection.
5 REPLIES 5

Re: Kudu Tablet Server High CPU

Contributor

Before we get into metrics and other lower-level troubleshooting techniques, let's start with how you're reading.

 

What are you using to read? If it's the raw Kudu API, are you using the LEADER_ONLY replica selection policy? If so, and if your three node cluster is heavily skewed so that the majority of leader replicas are on one node, it's possible for that node to be servicing the majority of your scans.

 

Re: Kudu Tablet Server High CPU

New Contributor
Yes, we are using the API via the Kudu client java libs. I don’t see the code passing a replica selection policy when we create a scanner object so it is the default which I believe must be Leader.

How can I find the balance of the leader replica’s amongst the nodes?

Re: Kudu Tablet Server High CPU

Contributor
I don't believe there's a way to do that as yet. You can run the manual rebalancer in report-only mode ('kudu cluster rebalance --report_only') and see what it says.

If you don't need the stronger consistency guarantees of LEADER_ONLY, change your replica selection policy to CLOSEST_REPLICA, and that should ensure a more even distribution of reads provided your scan requests are evenly originated amongst the cluster's nodes.

Re: Kudu Tablet Server High CPU

New Contributor

Will the rebalancer distribute the leaders evenly amongst the cluster.  It is not clear from the docs, seems it only balances the replica's which should result in leaders also being balanced as well?

Let’s say “I only have 1 host reading from the cluster and I select closest_replica. Won’t I end up in the same situation? How does the master distribute load? IP address hash? Can I change this to RR or this something controlled from the client side?

For others reading this post, I was able to identify the leader tablet distribution without using the rebalance tool. I am on an older Kudu that does not provide the tool.

I was able to copy and paste the live tablet info from the UI of the t-servers into excel and found that 95% of the of leaders are on the first t-server.

Thanks

Re: Kudu Tablet Server High CPU

Contributor
No, the rebalancer doesn't fix leader skew. It may in a future release. Leaders can cluster onto one tserver when individual tservers are restarted; if you restart the entire cluster all at once you might be able to redistribute leadership more evenly.

You're right that if you're only using one host to initiate reads, the reads will go to the local tserver rather than round-robin across the cluster. The master doesn't directly tell where clients to scan; it just provides them with enough information to make that decision based on their replica selection policy. There's also no way to do round robin (or randomized) replica selection.