05-19-2018 03:02 PM - edited 05-19-2018 03:03 PM
we have done some tests and compared kudu with parquet. In total parquet was about 170GB data. Our issue is that kudu uses about factor 2 more disk space than parquet (without any replication). We have measured the size of the data folder on the disk with "du". The WAL was in a different folder, so it wasn't included.
Below is my Schema for our table. column 0-7 are primary keys and we can't change that because of the uniqueness.
We are working with Kudu 1.6.0.
Any ideas why kudu uses two times more space on disk than parquet? Or is this expected behavior? We created about 2400 tablets distributed over 4 servers.
05-20-2018 02:34 AM - edited 05-20-2018 02:35 AM
I've checked some kudu metrics and I found out that at least the metric "kudu_on_disk_data_size" shows more or less the same size as the parquet files. However the "kudu_on_disk_size" metrics correlates with the size on the disk. I've created a new thread to discuss those two Kudu Metrics. I hope somebody can explain the difference.
05-21-2018 04:18 PM
I think Todd answered your question in the other thread pretty well. Kudu stores additional data structures that Parquet doesn't have to support its online indexed performance, including row indexes and bloom filters, that require additional space on top of what Parquet requires.
The kudu_on_disk_size metric also includes the size of the WAL and other metadata files like the tablet superblock and the consensus metadata (although those last two are usually relatively small).