Reply
New Contributor
Posts: 3
Registered: ‎08-14-2018
Accepted Solution

Impala performance on table with array complex type

[ Edited ]

I have a table that I'm querying with an array<String> column and queries are extremely slow when accessing the complex type.  The table is partitioned and Impala is doing a broadcast cross join on the "array" component and scanning ALL partitions, wherein the base part of the query, it's pruning the partitions based on the partition key and only scanning the partition I'm querying on.

 

 

Here is the query:

 

select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta, wdl_atomic.retail_transaction_attribute.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;

 

This is the explain plan:

 

Explain String	
Max Per-Host Resource Reservation: Memory=0B	
Per-Host Resource Estimates: Memory=44.61GB	
	
PLAN-ROOT SINK	
|	
04:EXCHANGE [UNPARTITIONED]	
|	
02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]	
|	
|--03:EXCHANGE [BROADCAST]	
|  |	
|  00:SCAN HDFS [retail_transaction_attribute rta]	
|     partitions=1/50 files=40 size=66.49GB	
|	
01:SCAN HDFS [retail_transaction_attribute.retail_offering_material_group_distinct_list bow]	
   partitions=50/50 files=2000 size=763.01GB

Is this expected behavior from Impala?  It seems odd that it wouldn't also just scan the same partitions in both parts of the query.  I also thought that Impala didn't need to perform an additional step to unpack the array into scalar values, which it's doing a CROSS JOIN for and that's generating hundreds of billions of rows to be broadcast.

 

Any insight is appreciated.

Thanks!

 

 

 

Highlighted
Cloudera Employee
Posts: 422
Registered: ‎07-29-2015

Re: Impala performance on table with array complex type

If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables".

 

I.e. you want to rewrite as follows:

 

select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta,
	rta.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;

That will solve your issue.