- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Avoiding hdfs scan when querying only partition columns
- Labels:
-
Apache Impala
Created on ‎07-24-2019 04:06 PM - edited ‎09-16-2022 07:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a slow query like:
select max(partition_col_1) from some_table where partition_col_2 = 'x'
and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column. Absolutely no need to read any files I don't think.
Any way or hint to get around this?
Created ‎07-24-2019 04:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes! Glad you asked.
There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.
The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.
Here it is in action:
[tarmstrong-box2.ca.cloudera.com:21000] default> set OPTIMIZE_PARTITION_KEY_SCANS = 1; OPTIMIZE_PARTITION_KEY_SCANS set to 1 [tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0; Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0 +--------------------------------------------------------+ | Explain String | +--------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=1 | | Per-Host Resource Estimates: Memory=10MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: max(ss_sold_date_sk) | | | row-size=4B cardinality=1 | | | | | 00:UNION | | constant-operands=182 | | row-size=4B cardinality=182 | +--------------------------------------------------------+
Created ‎07-24-2019 04:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes! Glad you asked.
There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.
The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.
Here it is in action:
[tarmstrong-box2.ca.cloudera.com:21000] default> set OPTIMIZE_PARTITION_KEY_SCANS = 1; OPTIMIZE_PARTITION_KEY_SCANS set to 1 [tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0; Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0 +--------------------------------------------------------+ | Explain String | +--------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=1 | | Per-Host Resource Estimates: Memory=10MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: max(ss_sold_date_sk) | | | row-size=4B cardinality=1 | | | | | 00:UNION | | constant-operands=182 | | row-size=4B cardinality=182 | +--------------------------------------------------------+
Created ‎07-28-2019 08:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The doc says:
This optimization does not apply if the queries contain any WHERE, GROUP BY, or HAVING clause.
Do you know if it still applies if WHERE condition only has partition column as well?
Cheers
Eric
Created ‎07-29-2019 12:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.
Created ‎07-29-2019 03:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.
P.S. I didn't get an email when you first replied, only yesterday with the latest ones. thanks for the quick response.
Created ‎07-29-2019 05:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I filed https://issues.apache.org/jira/browse/IMPALA-8807 to fix the docs.
