<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: What heuristics does Impala use for cardinality estimation in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47923#M47042</link>
    <description>&lt;P&gt;Any chance we might one day have the ability to override DEFAULT_SELECTIVITY via a query hint?&lt;/P&gt;</description>
    <pubDate>Wed, 23 Nov 2016 22:07:53 GMT</pubDate>
    <dc:creator>jamiet</dc:creator>
    <dc:date>2016-11-23T22:07:53Z</dc:date>
    <item>
      <title>What heuristics does Impala use for cardinality estimation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47899#M47039</link>
      <description>&lt;P&gt;I've got a query running that&amp;nbsp;scans&amp;nbsp;a table that has 17.7billion rows in it. I have some (non-partition-pruning) filters on that table.&lt;/P&gt;&lt;P&gt;If I look at the query plan the cardinality estimate of the scan is 1.7billion rows, exactly one tenth of the total number of rows in the table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm simply intrigued as to why Impala estimates cardinality of the query to be exactly one tenth. What heuristics does Impala use (if any) to determine this? Or is this simply an arbitrary rule of "if there are filters on the table then make an assumption that one tenth of the data is returned".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Just interested to know that's all.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Jamie&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;P.S. here is the section of the explain plan pertinent to the scan:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;00:SCAN HDFS [tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current, RANDOM]
   partitions=1/1 files=5184 size=1.21TB
   predicates: tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0cafw_bsk_custperpurch_52w_cnt &amp;gt; 0, (tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_56w_cnt &amp;gt; 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_101w107w_cnt &amp;gt; 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_153w159w_cnt &amp;gt; 0 OR tuk_sseft.cu0pr0cafw_cu0pr1cafw_cu0cafw_pr0cafw_current.cu0pr0cafw_bsk_custprodperpurch_205w211w_cnt &amp;gt; 0)
   runtime filters: RF000 -&amp;gt; Cu0Pr0Cafw_product
   table stats: 17738533540 rows total
   column stats: all
   hosts=18 per-host-mem=5.24GB
   tuple-ids=0 row-size=612B cardinality=1773853354&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:49:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47899#M47039</guid>
      <dc:creator>jamiet</dc:creator>
      <dc:date>2022-09-16T10:49:16Z</dc:date>
    </item>
    <item>
      <title>Re: What heuristics does Impala use for cardinality estimation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47910#M47040</link>
      <description>&lt;P&gt;&amp;nbsp;You're absolutely right - we use 10% as the default estimate for selectivity for scan predicates&amp;nbsp;when we don't have a better estimate. One case where we have a better estimate is when the predicate is something like id = 100. In that case we can estimate that the selectivity is 1 / (num distinct values).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There's also some logic to handle combining the estimates when there are multiple conditions.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you're curious, the code is here:&amp;nbsp;&lt;A href="https://github.com/apache/incubator-impala/blob/4db330e69a2dbb4a23f46e34b484da0d6b9ef29b/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L518" target="_blank"&gt;https://github.com/apache/incubator-impala/blob/4db330e69a2dbb4a23f46e34b484da0d6b9ef29b/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L518&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2016 17:18:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47910#M47040</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2016-11-23T17:18:34Z</dc:date>
    </item>
    <item>
      <title>Re: What heuristics does Impala use for cardinality estimation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47921#M47041</link>
      <description>&lt;P&gt;Nice, thanks Tim. Appreciate the reply and, even more so, the link to the source.&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2016 22:05:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47921#M47041</guid>
      <dc:creator>jamiet</dc:creator>
      <dc:date>2016-11-23T22:05:17Z</dc:date>
    </item>
    <item>
      <title>Re: What heuristics does Impala use for cardinality estimation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47923#M47042</link>
      <description>&lt;P&gt;Any chance we might one day have the ability to override DEFAULT_SELECTIVITY via a query hint?&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2016 22:07:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47923#M47042</guid>
      <dc:creator>jamiet</dc:creator>
      <dc:date>2016-11-23T22:07:53Z</dc:date>
    </item>
    <item>
      <title>Re: What heuristics does Impala use for cardinality estimation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47928#M47043</link>
      <description>&lt;P&gt;We had an issue filed for this a while back: &lt;A href="https://issues.cloudera.org/browse/IMPALA-3293" target="_blank"&gt;https://issues.cloudera.org/browse/IMPALA-3293&lt;/A&gt; . It seems fairly reasonable but I think will depend on how much demand there is for it (or if someone contributes a patch for it).&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2016 23:32:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-heuristics-does-Impala-use-for-cardinality-estimation/m-p/47928#M47043</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2016-11-23T23:32:55Z</dc:date>
    </item>
  </channel>
</rss>

