Reply
Explorer
Posts: 13
Registered: ‎05-19-2018

Kudu Size on Disk Compared to Parquet

[ Edited ]

Hi guys

 

we have done some tests and compared kudu with parquet. In total parquet was about 170GB data. Our issue is that kudu uses about factor 2 more disk space than parquet (without any replication). We have measured the size of the data folder on the disk with "du". The WAL was in a different folder, so it wasn't included.

 

Below is my Schema for our table. column 0-7 are primary keys and we can't change that because of the uniqueness.

 

We are working with Kudu 1.6.0.

 

Any ideas why kudu uses two times more space on disk than parquet? Or is this expected behavior? We created about 2400 tablets distributed over 4 servers.

 

Cheers

 

image001.png

Explorer
Posts: 13
Registered: ‎05-19-2018

Re: Kudu Size on Disk Compared to Parquet

[ Edited ]

I've checked some kudu metrics and I found out that at least the metric "kudu_on_disk_data_size" shows more or less the same size as the parquet files. However the "kudu_on_disk_size" metrics correlates with the size on the disk. I've created a new thread to discuss those two Kudu Metrics. I hope somebody can explain the difference.

New thread:
https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Kudu-Metrics-kudu-on-disk-data-size-am...

Cloudera Employee
Posts: 70
Registered: ‎04-08-2014

Re: Kudu Size on Disk Compared to Parquet

I think Todd answered your question in the other thread pretty well. Kudu stores additional data structures that Parquet doesn't have to support its online indexed performance, including row indexes and bloom filters, that require additional space on top of what Parquet requires.

 

The kudu_on_disk_size metric also includes the size of the WAL and other metadata files like the tablet superblock and the consensus metadata (although those last two are usually relatively small).