Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Impala performance on table with array complex type

SOLVED Go to solution

Impala performance on table with array complex type

New Contributor

I have a table that I'm querying with an array<String> column and queries are extremely slow when accessing the complex type.  The table is partitioned and Impala is doing a broadcast cross join on the "array" component and scanning ALL partitions, wherein the base part of the query, it's pruning the partitions based on the partition key and only scanning the partition I'm querying on.

 

 

Here is the query:

 

select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta, wdl_atomic.retail_transaction_attribute.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;

 

This is the explain plan:

 

Explain String	
Max Per-Host Resource Reservation: Memory=0B	
Per-Host Resource Estimates: Memory=44.61GB	
	
PLAN-ROOT SINK	
|	
04:EXCHANGE [UNPARTITIONED]	
|	
02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]	
|	
|--03:EXCHANGE [BROADCAST]	
|  |	
|  00:SCAN HDFS [retail_transaction_attribute rta]	
|     partitions=1/50 files=40 size=66.49GB	
|	
01:SCAN HDFS [retail_transaction_attribute.retail_offering_material_group_distinct_list bow]	
   partitions=50/50 files=2000 size=763.01GB

Is this expected behavior from Impala?  It seems odd that it wouldn't also just scan the same partitions in both parts of the query.  I also thought that Impala didn't need to perform an additional step to unpack the array into scalar values, which it's doing a CROSS JOIN for and that's generating hundreds of billions of rows to be broadcast.

 

Any insight is appreciated.

Thanks!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Impala performance on table with array complex type

Master Collaborator

If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables".

 

I.e. you want to rewrite as follows:

 

select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta,
	rta.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;

That will solve your issue.

1 REPLY 1

Re: Impala performance on table with array complex type

Master Collaborator

If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables".

 

I.e. you want to rewrite as follows:

 

select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta,
	rta.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;

That will solve your issue.