We are designing a detection system, in which we have two main parts:
1. Key-based queries:
- Get the last 20 activities for a specified key
- We expect several thousands per second, but want something that can scale to much more if required for large clients.
- Could be HBase or Kudu
2. Ad-hoc queries:
- Ad-hoc analytics
- should serve about 20 concurrent users. (Say, up to 100, for large clients)
- Could be HDFS Parquet or Kudu
We wanted to use a single storage for both, and Kudu seems great, if he can just deal with queries at high-rate.
Kudu can certainly scale to tens of thousands of point queries per second, similar to other NoSQL systems. For example, in preparing the slides posted on https://kudu.apache.org/2017/10/23/nosql-kudu-spanner-slides.html I ran a random-read benchmark using 5 16-core GCE machines and got 12k reads/second. Since then we've made significant improvements in random read performance and I expect you'd get much better than that if you were to re-run the benchmark on the latest versions. In a more recent benchmark on a 6-node physical cluster I was able to achieve over 100k reads/second.
Keep in mind that such numbers are only achievable through direct use of the Kudu API (i.e Java, C++, or Python) and not via SQL queries through an engine like Impala or Spark. Typically those engines are more suited towards longer (>100ms) analytic queries and not high-concurrency point lookups.