Created on 02-06-2019 05:22 PM - edited 09-16-2022 07:08 AM
I have a table that I'm querying with an array<String> column and queries are extremely slow when accessing the complex type. The table is partitioned and Impala is doing a broadcast cross join on the "array" component and scanning ALL partitions, wherein the base part of the query, it's pruning the partitions based on the partition key and only scanning the partition I'm querying on.
Here is the query:
select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item from wdl_atomic.retail_transaction_attribute rta, wdl_atomic.retail_transaction_attribute.retail_offering_material_group_distinct_list bow where rta.fiscal_period_id = 2019001;
This is the explain plan:
Explain String Max Per-Host Resource Reservation: Memory=0B Per-Host Resource Estimates: Memory=44.61GB PLAN-ROOT SINK | 04:EXCHANGE [UNPARTITIONED] | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | |--03:EXCHANGE [BROADCAST] | | | 00:SCAN HDFS [retail_transaction_attribute rta] | partitions=1/50 files=40 size=66.49GB | 01:SCAN HDFS [retail_transaction_attribute.retail_offering_material_group_distinct_list bow] partitions=50/50 files=2000 size=763.01GB
Is this expected behavior from Impala? It seems odd that it wouldn't also just scan the same partitions in both parts of the query. I also thought that Impala didn't need to perform an additional step to unpack the array into scalar values, which it's doing a CROSS JOIN for and that's generating hundreds of billions of rows to be broadcast.
Any insight is appreciated.
Thanks!
Created 02-07-2019 09:41 AM
If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables".
I.e. you want to rewrite as follows:
select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item from wdl_atomic.retail_transaction_attribute rta, rta.retail_offering_material_group_distinct_list bow where rta.fiscal_period_id = 2019001;
That will solve your issue.
Created 02-07-2019 09:41 AM
If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables".
I.e. you want to rewrite as follows:
select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item from wdl_atomic.retail_transaction_attribute rta, rta.retail_offering_material_group_distinct_list bow where rta.fiscal_period_id = 2019001;
That will solve your issue.