Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala is scanning all the Partitions in join query , how to optimize query ?

avatar

How can we do partition pruninng  in impala join query , not able to figure out and where clause is also not giving the desired results , need some help on this .

here is the query :

select a11.apn apn,
cast(a12.tariff_plan as int) tariff_plan,
a11.subscriber_id subscriber_id,
a11.partition_date partition_date,
a13.name name,
a12.tariff_plan tariff_plan0,
a12.charging_node charging_node,
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4)) CustCol_23,
sum(case when (a11.volume between 10*1024 and 50*1024) then 1 else 0 end) WJXBFS1,
sum(case when (a11.volume between 5*1024 and 10*1024) then 1 else 0 end) WJXBFS2,
sum(case when a11.volume < 5*1024 then 1 else 0 end) WJXBFS3,
sum(case when a11.volume>50*1024 then 1 else 0 end) WJXBFS4
from acc_volume_daily a11
join subscriber_history_daily a12
on (a11.partition_date = a12.partition_date and
a11.subscriber_id = a12.subscriber_id and
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4)) = concat(substring(cast(a12.partition_date as string),5,2),'/',substring(cast(a12.partition_date as string),7,2),'/',
substring(cast(a12.partition_date as string),1,4)))
join l_data_plans a13
on (cast(a12.tariff_plan as int) = a13.id)
where a11.partition_date = a12.partition_date = 20140411
group by a11.apn,
cast(a12.tariff_plan as int),
a11.subscriber_id,
a11.partition_date,
a13.name,
a12.tariff_plan,
a12.charging_node,
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4));

--------------------------------------------------------------------------------------------------------------------------------------------------------

here is the explain query result. 

Explain String
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PLAN FRAGMENT 0
PARTITION: UNPARTITIONED

10:EXCHANGE
tuple ids: 3

PLAN FRAGMENT 1
PARTITION: HASH_PARTITIONED: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))

STREAM DATA SINK
EXCHANGE ID: 10
UNPARTITIONED

9:AGGREGATE
| output: SUM(SUM(CASE WHEN (a11.volume BETWEEN 10 * 1024 AND 50 * 1024) THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN (a11.volume BETWEEN 5 * 1024 AND 10 * 1024) THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN a11.volume < 5 * 1024 THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN a11.volume > 50 * 1024 THEN 1 ELSE 0 END))
| group by: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))
| tuple ids: 3
|
8:EXCHANGE
tuple ids: 3

PLAN FRAGMENT 2
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 8
HASH_PARTITIONED: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))

5:AGGREGATE
| output: SUM(CASE WHEN (a11.volume BETWEEN 10 * 1024 AND 50 * 1024) THEN 1 ELSE 0 END), SUM(CASE WHEN (a11.volume BETWEEN 5 * 1024 AND 10 * 1024) THEN 1 ELSE 0 END), SUM(CASE WHEN a11.volume < 5 * 1024 THEN 1 ELSE 0 END), SUM(CASE WHEN a11.volume > 50 * 1024 THEN 1 ELSE 0 END)
| group by: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))
| tuple ids: 3
|
4:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| CAST(a12.tariff_plan AS INT) = a13.id
| tuple ids: 0 1 2
|
|----7:EXCHANGE
| tuple ids: 2
|
2:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| a11.partition_date = a12.partition_date
| a11.subscriber_id = a12.subscriber_id
| other predicates: concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4)) = concat(substring(CAST(a12.partition_date AS STRING), 5, 2), '/', substring(CAST(a12.partition_date AS STRING), 7, 2), '/', substring(CAST(a12.partition_date AS STRING), 1, 4))
| tuple ids: 0 1
|
|----6:EXCHANGE
| tuple ids: 1
|
0:SCAN HDFS
table=reporting_mini.acc_volume_daily #partitions=0 size=0B
tuple ids: 0

PLAN FRAGMENT 3
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 7
UNPARTITIONED

3:SCAN HDFS
table=reporting_mini.l_data_plans #partitions=1 size=21.96KB compact
tuple ids: 2

PLAN FRAGMENT 4
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 6
UNPARTITIONED

1:SCAN HDFS
table=reporting_mini.subscriber_history_daily #partitions=191 size=2.33GB compact
tuple ids: 1

1 ACCEPTED SOLUTION

avatar
Contributor

The problem is with the following predicate in the where clause "where a11.partition_date = a12.partition_date = 20140411". The condition a12.partition_date = 20140411 is not properly pushed to the scan node of a12, hence no partition pruning is happening. You may want to rewrite it as "a11.partition_date = a12.partition_date and a12.partition_date = 20140411". 

 

Dimitris

View solution in original post

2 REPLIES 2

avatar
Contributor

The problem is with the following predicate in the where clause "where a11.partition_date = a12.partition_date = 20140411". The condition a12.partition_date = 20140411 is not properly pushed to the scan node of a12, hence no partition pruning is happening. You may want to rewrite it as "a11.partition_date = a12.partition_date and a12.partition_date = 20140411". 

 

Dimitris

avatar

thanks for your reply , and you are right about partition pruning.