Reply
New Contributor
Posts: 1
Registered: ‎07-02-2018

Can Kudu replace HBase for key-based queries at high rate?

[ Edited ]

Hi,

 

We are designing a detection system, in which we have two main parts:
1. Key-based queries:   

  - Get the last 20 activities for a specified key

  - We expect several thousands per second, but want something that can scale to much more if required for large clients.

  - Could be HBase or Kudu

 

2. Ad-hoc queries:

  - Ad-hoc analytics

  - should serve about 20 concurrent users. (Say, up to 100, for large clients)

  - Could be HDFS Parquet or Kudu

 

We wanted to use a single storage for both, and Kudu seems great, if he can just deal with queries at high-rate.

  • Is Kudu a good fit for these kind of systems which usually use a NoSQL engine such as HBase or Cassandra?

  • What is the limit for Kudu in terms of queries-per-second? (Of course, depends on cluster specs, partitioning etc - can take this into account - but a rough estimate on scalability)

  • A link to something official or a recent benchmerk would also be appreciated. 

Thanks

Cloudera Employee
Posts: 64
Registered: ‎09-28-2015

Re: Can Kudu replace HBase for key-based queries at high rate?

Hi,

 

Kudu can certainly scale to tens of thousands of point queries per second, similar to other NoSQL systems. For example, in preparing the slides posted on https://kudu.apache.org/2017/10/23/nosql-kudu-spanner-slides.html I ran a random-read benchmark using 5 16-core GCE machines and got 12k reads/second. Since then we've made significant improvements in random read performance and I expect you'd get much better than that if you were to re-run the benchmark on the latest versions. In a more recent benchmark on a 6-node physical cluster I was able to achieve over 100k reads/second.

 

Keep in mind that such numbers are only achievable through direct use of the Kudu API (i.e Java, C++, or Python) and not via SQL queries through an engine like Impala or Spark. Typically those engines are more suited towards longer (>100ms) analytic queries and not high-concurrency point lookups.

 

-Todd

Announcements