Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Explorer

Hi guys

 

My question is related to the following two metrics: 

  • kudu_on_disk_data_size [Space used by this tablet's data blocks.] -> 1494MB
  • kudu_on_disk_size [Size of this tablet on disk.] -> 3010 MB

 

I've verified those two metrics for one example tablet. My question now, the kudu_on_disk_size makes sense in a way that this is what I see as well with "du" on linux. However, how is it possible that kudu_on_disk_size is in my example twice as big as kudu_on_disk_data_size? What kind of data is additionally saved on disk beside naked data?

 

A small hint regarding the data on this tablet, I'm using a schema with 8 primary keys (all Integers) out of 21 columns.

 

What I can say is, the kudu_on_disk_data_size metric size is more or less the same as the size for the same data in parquet format. At least that makes sense for me.

 

Thanks in advance

 

2 REPLIES 2

Re: Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Expert Contributor

The 'data size' is just the underlying columnar data blocks, with compression.

 

The total 'on disk size' is inclusive of some other structures like bloom filters (approximately 10 bits per row) as well as the synthetic composite key column. If you have 8 int64s as your primary key, this column would be about 64 bytes per row prior to compression. Depending on the cardinalities of these columns it's quite possible that they compress poorly.

 

-Todd

Re: Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Explorer
Thanks Todd! So the fact that we are using 8 (btw int32 -> 32byte per row) seems to cause this huge amount of meta data. To sum up, the 32 bytes plus aprox. 10 bits per row seems to be the gap between data_size and on_disk_size... Am I correct?

Is there a compression for the composite key column? I thought there is no compression at all...