Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Avoiding hdfs scan when querying only partition columns

Solved Go to solution

Avoiding hdfs scan when querying only partition columns

Contributor

We have a slow query like:

 

select max(partition_col_1) from some_table where partition_col_2 = 'x'

 

and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column.  Absolutely no need to read any files I don't think.

 

Any way or hint to get around this?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Avoiding hdfs scan when querying only partition columns

Master Collaborator

Yes! Glad you asked.

 

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

 

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

 

Here it is in action:

 

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+
5 REPLIES 5

Re: Avoiding hdfs scan when querying only partition columns

Master Collaborator

Yes! Glad you asked.

 

There is an optimisation that can be enabled with the OPTIMIZE_PARTITION_KEY_SCANS query option: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_optimize_partition_key_scans..... This converts queries like your example into a metadata-only query.

 

The only reason it isn't enabled by default is because you can get different results if you have a partition with only files with 0 rows in it - the metadata doesn't have enough information to detect this case.

 

Here it is in action:

 

[tarmstrong-box2.ca.cloudera.com:21000] default> set  OPTIMIZE_PARTITION_KEY_SCANS = 1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
[tarmstrong-box2.ca.cloudera.com:21000] default> explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0;
Query: explain select max(ss_sold_date_sk) from tpcds_parquet.store_sales where ss_sold_date_sk % 10 = 0
+--------------------------------------------------------+
| Explain String                                         |
+--------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B Threads=1 |
| Per-Host Resource Estimates: Memory=10MB               |
| Codegen disabled by planner                            |
|                                                        |
| PLAN-ROOT SINK                                         |
| |                                                      |
| 01:AGGREGATE [FINALIZE]                                |
| |  output: max(ss_sold_date_sk)                        |
| |  row-size=4B cardinality=1                           |
| |                                                      |
| 00:UNION                                               |
|    constant-operands=182                               |
|    row-size=4B cardinality=182                         |
+--------------------------------------------------------+

Re: Avoiding hdfs scan when querying only partition columns

Guru
Hi Tim,

The doc says:

This optimization does not apply if the queries contain any WHERE, GROUP BY, or HAVING clause.

Do you know if it still applies if WHERE condition only has partition column as well?

Cheers
Eric

Re: Avoiding hdfs scan when querying only partition columns

Master Collaborator

That example does show that it works in at least one case with a where referencing a partition column. I don't know off the top of my head the exact set of cases where it works, but it does seem like the docs are not totally accurate.

Re: Avoiding hdfs scan when querying only partition columns

Contributor

Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented.

 

P.S. I didn't get an email when you first replied, only yesterday with the latest ones.  thanks for the quick response.

Highlighted

Re: Avoiding hdfs scan when querying only partition columns

Master Collaborator
Don't have an account?
Coming from Hortonworks? Activate your account here