Reply
Highlighted
New Contributor
Posts: 6
Registered: ‎05-19-2018

Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Hi guys

 

My question is related to the following two metrics: 

  • kudu_on_disk_data_size [Space used by this tablet's data blocks.] -> 1494MB
  • kudu_on_disk_size [Size of this tablet on disk.] -> 3010 MB

 

I've verified those two metrics for one example tablet. My question now, the kudu_on_disk_size makes sense in a way that this is what I see as well with "du" on linux. However, how is it possible that kudu_on_disk_size is in my example twice as big as kudu_on_disk_data_size? What kind of data is additionally saved on disk beside naked data?

 

A small hint regarding the data on this tablet, I'm using a schema with 8 primary keys (all Integers) out of 21 columns.

 

What I can say is, the kudu_on_disk_data_size metric size is more or less the same as the size for the same data in parquet format. At least that makes sense for me.

 

Thanks in advance

 

Cloudera Employee
Posts: 62
Registered: ‎09-28-2015

Re: Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

The 'data size' is just the underlying columnar data blocks, with compression.

 

The total 'on disk size' is inclusive of some other structures like bloom filters (approximately 10 bits per row) as well as the synthetic composite key column. If you have 8 int64s as your primary key, this column would be about 64 bytes per row prior to compression. Depending on the cardinalities of these columns it's quite possible that they compress poorly.

 

-Todd

New Contributor
Posts: 6
Registered: ‎05-19-2018

Re: Kudu Metrics: kudu_on_disk_data_size & kudu_on_disk_size

Thanks Todd! So the fact that we are using 8 (btw int32 -> 32byte per row) seems to cause this huge amount of meta data. To sum up, the 32 bytes plus aprox. 10 bits per row seems to be the gap between data_size and on_disk_size... Am I correct?

Is there a compression for the composite key column? I thought there is no compression at all...
Announcements