<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Avoiding hdfs scan when querying only partition columns in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93338#M57193</link>
    <description>&lt;P&gt;I filed &lt;A href="https://issues.apache.org/jira/browse/IMPALA-8807" target="_blank"&gt;https://issues.apache.org/jira/browse/IMPALA-8807&lt;/A&gt; to fix the docs.&lt;/P&gt;</description>
    <pubDate>Tue, 30 Jul 2019 00:12:16 GMT</pubDate>
    <dc:creator>Tim Armstrong</dc:creator>
    <dc:date>2019-07-30T00:12:16Z</dc:date>
    <item>
      <title>Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93197#M12407</link>
      <description>&lt;P&gt;We have a slow query like:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;select max(partition_col_1) from some_table where partition_col_2 = 'x'&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column.&amp;nbsp; Absolutely no need to read any files I don't think.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any way or hint to get around this?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 14:31:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93197#M12407</guid>
      <dc:creator>mauricio</dc:creator>
      <dc:date>2022-09-16T14:31:57Z</dc:date>
    </item>
    <item>
      <title>Re: Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93198#M12408</link>
      <description>&lt;P&gt;Yes! Glad you asked.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: &lt;A href="https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans.html" target="_blank"&gt;https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans.html&lt;/A&gt;. This converts queries like your example into a metadata-only query.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here it is in action:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;[tarmstrong-box2.ca.cloudera.com:21000] default&amp;gt; set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default&amp;gt; explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+&lt;/PRE&gt;</description>
      <pubDate>Wed, 24 Jul 2019 23:28:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93198#M12408</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2019-07-24T23:28:54Z</dc:date>
    </item>
    <item>
      <title>Re: Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93305#M57173</link>
      <description>Hi Tim,&lt;BR /&gt;&lt;BR /&gt;The doc says:&lt;BR /&gt;&lt;BR /&gt;This optimization does not apply if the queries contain any WHERE, GROUP BY, or HAVING clause.&lt;BR /&gt;&lt;BR /&gt;Do you know if it still applies if WHERE condition only has partition column as well?&lt;BR /&gt;&lt;BR /&gt;Cheers&lt;BR /&gt;Eric</description>
      <pubDate>Mon, 29 Jul 2019 03:52:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93305#M57173</guid>
      <dc:creator>EricL</dc:creator>
      <dc:date>2019-07-29T03:52:51Z</dc:date>
    </item>
    <item>
      <title>Re: Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93315#M57174</link>
      <description>&lt;P&gt;That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.&lt;/P&gt;</description>
      <pubDate>Mon, 29 Jul 2019 07:18:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93315#M57174</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2019-07-29T07:18:55Z</dc:date>
    </item>
    <item>
      <title>Re: Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93337#M57192</link>
      <description>&lt;P&gt;Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;P.S. I didn't get an email when you first replied, only yesterday with the latest ones.&amp;nbsp; thanks for the quick response.&lt;/P&gt;</description>
      <pubDate>Mon, 29 Jul 2019 22:46:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93337#M57192</guid>
      <dc:creator>mauricio</dc:creator>
      <dc:date>2019-07-29T22:46:23Z</dc:date>
    </item>
    <item>
      <title>Re: Avoiding hdfs scan when querying only partition columns</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93338#M57193</link>
      <description>&lt;P&gt;I filed &lt;A href="https://issues.apache.org/jira/browse/IMPALA-8807" target="_blank"&gt;https://issues.apache.org/jira/browse/IMPALA-8807&lt;/A&gt; to fix the docs.&lt;/P&gt;</description>
      <pubDate>Tue, 30 Jul 2019 00:12:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Avoiding-hdfs-scan-when-querying-only-partition-columns/m-p/93338#M57193</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2019-07-30T00:12:16Z</dc:date>
    </item>
  </channel>
</rss>

