Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

kudu is slower than parquet?


While we doing tpc-ds testing on impala+kudu vs impala+parquet(according to, we found that for most of the queries, impala+parquet is 2times~10times faster than impala+kudu.
Is any body ever did the same testing?

ps:We are running kudu 1.3.0 with cdh 5.10.


Expert Contributor
How much RAM did you give to Kudu? The default is 1G which starves it.

Please share the HW and SW specs and the results. I am quite interested. As pointed out, both could sway the results as even Impala's defaults are anemic.

Also, I want to point out that Kudu is a filesystem, Impala is an in-memory query engine. Parquet is a file format.

So what you are really comparing is Impala+Kudu v Impala+HDFS. You should be using the same file format for both to make it a direct comparison. Also, I don't view Kudu as the inherently faster option. Yes it is written in C which can be faster than Java and it, I believe, is less of an abstraction. Anyway, my point is that Kudu is great for somethings and HDFS is great for others. It isn't an this or that based on performance, at least in my opinion.

We'd expect Kudu to be slower than Parquet on a pure read benchmark, but not 10x slower - that may be a configuration problem. We've published results on the Cloudera blog before that demonstrate this:


Parquet is a read-only storage format while Kudu supports row-level updates so they make different trade-offs. I think we have headroom to significantly improve the performance of both table formats in Impala over time.

E.g. in Impala 2.9/CDH5.12 IMPALA-5347 and IMPALA-5304 improve pure Parquet scan performance by 50%+ on some workloads, and I think there are probably similar opportunities for Kudu.


Cloudera Employee

@mbigelow, You've brought up a good point that HDFS is going to be strong for some workloads, while Kudu will be better for others.  It's not quite right to characterize Kudu as a file system, however.  Kudu is a distributed, columnar storage engine.  In other words, Kudu provides storage for tables, not files.  So in this case it is fair to compare Impala+Kudu to Impala+HDFS+Parquet.


Thanks all for your reply, here is some detail about the testing.

We are running impalad+kudu on 14 nodes,

nodes info:

cpu model : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

cpu cores: 32

mem: 128G

disk: 4T*12, sas


impalad and kudu are installed on each node, with 16G MEM for kudu, and 96G MEM for impalad.

parquet files are stored on another hadoop cluster with about 80+ nodes(running hdfs+yarn).


We are running tpc-ds queries( .

With the 18 queries, each query were run with 3 times, (3 times on impala+kudu, 3 times on impala+parquet)and then we caculate the average time. While compare to the average query time of each query,we found that  kudu is slower than parquet. Here is the result of the 18 queries:



We are planing to setup an olap system, so we compare impala+kudu vs impala+parquet to see which is the good choice.

Expert Contributor

Make sure you run COMPUTE STATS after loading the data so that Impala knows how to join the Kudu tables.


What is the total size of your data set?


I am surprised at the difference in your numbers and I think they should be closer if tuned correctly. Regardless, if you don't need to be able to do online inserts and updates, then Kudu won't buy you much over the raw scan speed of an immutable on-disk format like Impala + Parquet on HDFS.



Expert Contributor

Can you also share how you partitioned your Kudu table?


1, Make sure you run COMPUTE STATS: yes, we do this after loading data


2, What is the total size of your data set?

impala tpc-ds tool create 9 dim tables and 1 fact table,

which dim tables are small(record num from 1k to 4million+ according to the datasize generated),

and the fact table is big, here is the 'data siez-->record num' of fact table:





3, Can you also share how you partitioned your Kudu table?

for the dim tables, we hash partition it into 2 partitions by their primary (no partition for parquet table),

for the fact table, we range partition it into 60 partitions by its 'data field'(parquet partition into 1800+ partitions),

for those tables create in kudu, their replication factor is 3.

Expert Contributor
Could you check whether you are under the current scale recommendations for

We are working hard on increasing these limits and will try to do so for
each coming release.

Current scale limits for CDH 5.11 (Kudu 1.3):

Expert Contributor

If you are under the scale limits consider increasing # of partitions. Impala tends to use one thread per partition when scanning.

This is a good suggestion, we are under the scale limits.
We may run another test in a later time, e.g. increasing # of partitions...

Expert Contributor

Impala heavily relies on parallelism for throughput so if you have 60 partitions for Kudu and 1800 partitions for Parquet then due to Impala's current single-thread-per-partition limitation you have built in a huge disadvantage for Kudu in this comparison.


Please let us know if you re-run your comparison test.


I have been re-run the test, and kudu perform much better this time(though it's still a little bit slower than parquet), thanks for @mpercy's suggestion.

I changed two  things by re-runing the test:

1, increase the partitions for the fact table from 60 to 768(affact all queries)

2, change the query3.sql 'or' predicate into 'in' predicate, so predicate can push down to kudu(only affact query 3)


below is the re-run result:

(column 'kudu60' is the previous result, which means the partitions of fact table is 60 ) 

(column 'kudu768' is the new result, which means the partitions of fact table is 768