Created on 01-29-2018 10:47 AM - edited 09-16-2022 05:48 AM
Hi,
I thought that during a particular scan Kudu is reporting a number of rows reand in realt-time per each column. At least on small table it was equal to roughly the number of rows in the partition.
But now I am scanning a 1 billion (1 000 000 000) row table, the table is partitioned into multiple partitions. And the cells read shows 2.3billion 4.6 billion etc.
Can somebody explain why those numbers are so high?
Created 01-31-2018 06:31 AM
Found out that if multiple spark tasks are reading the same tablet (partition) then it counts multiple times the reads. Therefore the total cells read could be much higher than the number of rows in tablet, actual # of tasks x # rows.
Created 01-31-2018 06:30 AM
Found out that if multiple spark tasks are reading the same tablet (partition) then it counts multiple times the reads. Therefore the total cells read could be much higher than the number of rows in tablet, actual # of tasks x # rows.
Created 01-31-2018 10:05 AM
Created 01-31-2018 06:31 AM
Found out that if multiple spark tasks are reading the same tablet (partition) then it counts multiple times the reads. Therefore the total cells read could be much higher than the number of rows in tablet, actual # of tasks x # rows.