Created 05-20-2018 02:24 AM
Hi guys
My question is related to the following two metrics:
I've verified those two metrics for one example tablet. My question now, the kudu_on_disk_size makes sense in a way that this is what I see as well with "du" on linux. However, how is it possible that kudu_on_disk_size is in my example twice as big as kudu_on_disk_data_size? What kind of data is additionally saved on disk beside naked data?
A small hint regarding the data on this tablet, I'm using a schema with 8 primary keys (all Integers) out of 21 columns.
What I can say is, the kudu_on_disk_data_size metric size is more or less the same as the size for the same data in parquet format. At least that makes sense for me.
Thanks in advance
Created 05-21-2018 09:23 AM
The 'data size' is just the underlying columnar data blocks, with compression.
The total 'on disk size' is inclusive of some other structures like bloom filters (approximately 10 bits per row) as well as the synthetic composite key column. If you have 8 int64s as your primary key, this column would be about 64 bytes per row prior to compression. Depending on the cardinalities of these columns it's quite possible that they compress poorly.
-Todd
Created 05-21-2018 11:50 AM