Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Who agreed with this topic

What heuristics does Impala use for cardinality estimation

avatar
Explorer

I've got a query running that scans a table that has 17.7billion rows in it. I have some (non-partition-pruning) filters on that table.

If I look at the query plan the cardinality estimate of the scan is 1.7billion rows, exactly one tenth of the total number of rows in the table.

 

I'm simply intrigued as to why Impala estimates cardinality of the query to be exactly one tenth. What heuristics does Impala use (if any) to determine this? Or is this simply an arbitrary rule of "if there are filters on the table then make an assumption that one tenth of the data is returned".

 

Just interested to know that's all.

 

Regards

Jamie

 

P.S. here is the section of the explain plan pertinent to the scan:

 

00:SCAN HDFS [tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current, RANDOM]
   partitions=1/1 files=5184 size=1.21TB
   predicates: tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0cafw_bsk_custperpurch_52w_cnt > 0, (tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_56w_cnt > 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_101w107w_cnt > 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_153w159w_cnt > 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_205w211w_cnt > 0)
   runtime filters: RF000 -> Cu0Pr0Cafw_product
   table stats: 17738533540 rows total
   column stats: all
   hosts=18 per-host-mem=5.24GB
   tuple-ids=0 row-size=612B cardinality=1773853354

 

Who agreed with this topic