Support Questions

mauricio · ‎07-24-2019

We have a slow query like:

select max(partition_col_1) from some_table where partition_col_2 = 'x'

and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column. Absolutely no need to read any files I don't think.

Any way or hint to get around this?

Tim Armstrong · ‎07-24-2019

Yes! Glad you asked.

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

Here it is in action:

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+

View solution in original post

Tim Armstrong · ‎07-24-2019

Yes! Glad you asked.

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

Here it is in action:

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+

EricL · ‎07-28-2019

Hi Tim,

The doc says:

This optimization does not apply if the queries contain any WHERE, GROUP BY, or HAVING clause.

Do you know if it still applies if WHERE condition only has partition column as well?

Cheers
Eric

Tim Armstrong · ‎07-29-2019

That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.

mauricio · ‎07-29-2019

Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.

P.S. I didn't get an email when you first replied, only yesterday with the latest ones. thanks for the quick response.

Tim Armstrong · ‎07-29-2019

I filed https://issues.apache.org/jira/browse/IMPALA-8807 to fix the docs.

Cloudera Community

Support Questions

Avoiding hdfs scan when querying only partition columns

Adding new columns to an already partitioned Hive ...

Scaling the HDFS NameNode (part 4) - Avoiding Perf...

Excluding HDFS from antivirus scans

How to avoid full table scan while running hbase c...

Hive partition - partition column as part of the d...

Spark to support REGEX column specification for Hi...

query on partitioning

Help diagnosing slow query (even though fast hdfs ...

hive Insert to Dynamic Partition query Generating ...

Excluding CDP components from Antivirus scans