Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Impala is scanning all the Partitions in join query , how to optimize query ?

avatar

How can we do partition pruninng  in impala join query , not able to figure out and where clause is also not giving the desired results , need some help on this .

here is the query :

select a11.apn apn,
cast(a12.tariff_plan as int) tariff_plan,
a11.subscriber_id subscriber_id,
a11.partition_date partition_date,
a13.name name,
a12.tariff_plan tariff_plan0,
a12.charging_node charging_node,
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4)) CustCol_23,
sum(case when (a11.volume between 10*1024 and 50*1024) then 1 else 0 end) WJXBFS1,
sum(case when (a11.volume between 5*1024 and 10*1024) then 1 else 0 end) WJXBFS2,
sum(case when a11.volume < 5*1024 then 1 else 0 end) WJXBFS3,
sum(case when a11.volume>50*1024 then 1 else 0 end) WJXBFS4
from acc_volume_daily a11
join subscriber_history_daily a12
on (a11.partition_date = a12.partition_date and
a11.subscriber_id = a12.subscriber_id and
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4)) = concat(substring(cast(a12.partition_date as string),5,2),'/',substring(cast(a12.partition_date as string),7,2),'/',
substring(cast(a12.partition_date as string),1,4)))
join l_data_plans a13
on (cast(a12.tariff_plan as int) = a13.id)
where a11.partition_date = a12.partition_date = 20140411
group by a11.apn,
cast(a12.tariff_plan as int),
a11.subscriber_id,
a11.partition_date,
a13.name,
a12.tariff_plan,
a12.charging_node,
concat(substring(cast(a11.partition_date as string),5,2),'/',substring(cast(a11.partition_date as string),7,2),'/',
substring(cast(a11.partition_date as string),1,4));

--------------------------------------------------------------------------------------------------------------------------------------------------------

here is the explain query result. 

Explain String
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PLAN FRAGMENT 0
PARTITION: UNPARTITIONED

10:EXCHANGE
tuple ids: 3

PLAN FRAGMENT 1
PARTITION: HASH_PARTITIONED: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))

STREAM DATA SINK
EXCHANGE ID: 10
UNPARTITIONED

9:AGGREGATE
| output: SUM(SUM(CASE WHEN (a11.volume BETWEEN 10 * 1024 AND 50 * 1024) THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN (a11.volume BETWEEN 5 * 1024 AND 10 * 1024) THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN a11.volume < 5 * 1024 THEN 1 ELSE 0 END)), SUM(SUM(CASE WHEN a11.volume > 50 * 1024 THEN 1 ELSE 0 END))
| group by: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))
| tuple ids: 3
|
8:EXCHANGE
tuple ids: 3

PLAN FRAGMENT 2
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 8
HASH_PARTITIONED: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))

5:AGGREGATE
| output: SUM(CASE WHEN (a11.volume BETWEEN 10 * 1024 AND 50 * 1024) THEN 1 ELSE 0 END), SUM(CASE WHEN (a11.volume BETWEEN 5 * 1024 AND 10 * 1024) THEN 1 ELSE 0 END), SUM(CASE WHEN a11.volume < 5 * 1024 THEN 1 ELSE 0 END), SUM(CASE WHEN a11.volume > 50 * 1024 THEN 1 ELSE 0 END)
| group by: a11.apn, CAST(a12.tariff_plan AS INT), a11.subscriber_id, a11.partition_date, a13.name, a12.tariff_plan, a12.charging_node, concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4))
| tuple ids: 3
|
4:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| CAST(a12.tariff_plan AS INT) = a13.id
| tuple ids: 0 1 2
|
|----7:EXCHANGE
| tuple ids: 2
|
2:HASH JOIN
| join op: INNER JOIN (BROADCAST)
| hash predicates:
| a11.partition_date = a12.partition_date
| a11.subscriber_id = a12.subscriber_id
| other predicates: concat(substring(CAST(a11.partition_date AS STRING), 5, 2), '/', substring(CAST(a11.partition_date AS STRING), 7, 2), '/', substring(CAST(a11.partition_date AS STRING), 1, 4)) = concat(substring(CAST(a12.partition_date AS STRING), 5, 2), '/', substring(CAST(a12.partition_date AS STRING), 7, 2), '/', substring(CAST(a12.partition_date AS STRING), 1, 4))
| tuple ids: 0 1
|
|----6:EXCHANGE
| tuple ids: 1
|
0:SCAN HDFS
table=reporting_mini.acc_volume_daily #partitions=0 size=0B
tuple ids: 0

PLAN FRAGMENT 3
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 7
UNPARTITIONED

3:SCAN HDFS
table=reporting_mini.l_data_plans #partitions=1 size=21.96KB compact
tuple ids: 2

PLAN FRAGMENT 4
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 6
UNPARTITIONED

1:SCAN HDFS
table=reporting_mini.subscriber_history_daily #partitions=191 size=2.33GB compact
tuple ids: 1

1 ACCEPTED SOLUTION

avatar
Contributor

The problem is with the following predicate in the where clause "where a11.partition_date = a12.partition_date = 20140411". The condition a12.partition_date = 20140411 is not properly pushed to the scan node of a12, hence no partition pruning is happening. You may want to rewrite it as "a11.partition_date = a12.partition_date and a12.partition_date = 20140411". 

 

Dimitris

View solution in original post

2 REPLIES 2

avatar
Contributor

The problem is with the following predicate in the where clause "where a11.partition_date = a12.partition_date = 20140411". The condition a12.partition_date = 20140411 is not properly pushed to the scan node of a12, hence no partition pruning is happening. You may want to rewrite it as "a11.partition_date = a12.partition_date and a12.partition_date = 20140411". 

 

Dimitris

avatar

thanks for your reply , and you are right about partition pruning.