Created on 07-24-2019 04:06 PM - edited 09-16-2022 07:31 AM
We have a slow query like:
select max(partition_col_1) from some_table where partition_col_2 = 'x'
and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column. Absolutely no need to read any files I don't think.
Any way or hint to get around this?
Created 07-24-2019 04:28 PM
Yes! Glad you asked.
There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.
The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.
Here it is in action:
[tarmstrong-box2.ca.cloudera.com:21000] default> set OPTIMIZE_PARTITION_KEY_SCANS = 1; OPTIMIZE_PARTITION_KEY_SCANS set to 1 [tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0; Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0 +--------------------------------------------------------+ | Explain String | +--------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=1 | | Per-Host Resource Estimates: Memory=10MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: max(ss_sold_date_sk) | | | row-size=4B cardinality=1 | | | | | 00:UNION | | constant-operands=182 | | row-size=4B cardinality=182 | +--------------------------------------------------------+
Created 07-24-2019 04:28 PM
Yes! Glad you asked.
There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.
The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.
Here it is in action:
[tarmstrong-box2.ca.cloudera.com:21000] default> set OPTIMIZE_PARTITION_KEY_SCANS = 1; OPTIMIZE_PARTITION_KEY_SCANS set to 1 [tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0; Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0 +--------------------------------------------------------+ | Explain String | +--------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=1 | | Per-Host Resource Estimates: Memory=10MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: max(ss_sold_date_sk) | | | row-size=4B cardinality=1 | | | | | 00:UNION | | constant-operands=182 | | row-size=4B cardinality=182 | +--------------------------------------------------------+
Created 07-28-2019 08:52 PM
Created 07-29-2019 12:18 AM
That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.
Created 07-29-2019 03:46 PM
Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.
P.S. I didn't get an email when you first replied, only yesterday with the latest ones. thanks for the quick response.
Created 07-29-2019 05:12 PM
I filed https://issues.apache.org/jira/browse/IMPALA-8807 to fix the docs.