Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Avoiding hdfs scan when querying only partition columns

avatar
Rising Star

We have a slow query like:

 

select max(partition_col_1) from some_table where partition_col_2 = 'x'

 

and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column.  Absolutely no need to read any files I don't think.

 

Any way or hint to get around this?

1 ACCEPTED SOLUTION

avatar

Yes! Glad you asked.

 

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

 

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

 

Here it is in action:

 

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+

View solution in original post

5 REPLIES 5

avatar

Yes! Glad you asked.

 

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

 

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

 

Here it is in action:

 

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+

avatar
Super Guru
Hi Tim,

The doc says:

This optimization does not apply if the queries contain any WHERE, GROUP BY, or HAVING clause.

Do you know if it still applies if WHERE condition only has partition column as well?

Cheers
Eric

avatar

That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.

avatar
Rising Star

Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.

 

P.S. I didn't get an email when you first replied, only yesterday with the latest ones.  thanks for the quick response.

avatar