Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

kudu is slower than parquet?

Explorer

While we doing tpc-ds testing on impala+kudu vs impala+parquet(according to https://github.com/cloudera/impala-tpcds-kit), we found that for most of the queries, impala+parquet is 2times~10times faster than impala+kudu.
Is any body ever did the same testing?


ps:We are running kudu 1.3.0 with cdh 5.10.

13 REPLIES 13

Expert Contributor

If you are under the scale limits consider increasing # of partitions. Impala tends to use one thread per partition when scanning.

Explorer
This is a good suggestion, we are under the scale limits.
We may run another test in a later time, e.g. increasing # of partitions...

Expert Contributor

Impala heavily relies on parallelism for throughput so if you have 60 partitions for Kudu and 1800 partitions for Parquet then due to Impala's current single-thread-per-partition limitation you have built in a huge disadvantage for Kudu in this comparison.

 

Please let us know if you re-run your comparison test.

Explorer

I have been re-run the test, and kudu perform much better this time(though it's still a little bit slower than parquet), thanks for @mpercy's suggestion.

I changed two  things by re-runing the test:

1, increase the partitions for the fact table from 60 to 768(affact all queries)

2, change the query3.sql 'or' predicate into 'in' predicate, so predicate can push down to kudu(only affact query 3)

 

below is the re-run result:

(column 'kudu60' is the previous result, which means the partitions of fact table is 60 ) 

(column 'kudu768' is the new result, which means the partitions of fact table is 768

kudu-parquet2.png