Support Questions

PranayMunshi · ‎04-11-2019

Hi friends,
I am working on optimizing the Impala Query performance, for that purpose I am using query options in JDBC URL.
After running the query I am trying to analyse the Plan, Text Plan, Summary and Profile.
(http://host:25000/query_plan?query_id=fsdfsf432a:997fre34ss000000)

But I don't which thing I should consider and how to interpret? I can see lot of terminologies but not able to understand any of them. I could not come to conclusion for performance. I don't which thing I need to modify after observing the Text Plan or Summary or Profile.

Which is more important Summary or Text Plan or Profile?

Could anyone please tell guide me how to understand/interpret below terminology-

Query -
*******************************************************
select ss_sold_time_sk, time_dim.t_time_sk, ss_hdemo_sk, household_demographics.hd_demo_sk, ss_store_sk, s_store_sk, t_hour, t_minute, household_demographics.hd_dep_count, store.s_store_name from store_sales ,household_demographics ,time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and household_demographics.hd_dep_count = 5 and store.s_store_name = 'ese' limit 10000;
*********************************************************************

Text Plan (Why there are indentation and numbers like 10:EXCHANGE [UNPARTITIONED] then 09:EXCHANGE [BROADCAST] )

----------------
Estimated Per-Host Requirements: Memory=384.15MB VCores=4
WARNING: The following tables are missing relevant table and/or column statistics.
tpcds_bin_partitioned_textfile_40.household_demographics, tpcds_bin_partitioned_textfile_40.time_dim

10:EXCHANGE [UNPARTITIONED]
| limit: 10000
| hosts=2 per-host-mem=unavailable
| tuple-ids=0,3,1,2 row-size=80B cardinality=10000
|
06:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ss_sold_time_sk = time_dim.t_time_sk
| runtime filters: RF000 <- time_dim.t_time_sk
| limit: 10000
| hosts=2 per-host-mem=148.50KB
| tuple-ids=0,3,1,2 row-size=80B cardinality=10000
|
|--09:EXCHANGE [BROADCAST]
| | hosts=1 per-host-mem=0B
| | tuple-ids=2 row-size=16B cardinality=8640
| |
| 02:SCAN HDFS [tpcds_bin_partitioned_textfile_40.time_dim, RANDOM]
| partitions=1/1 files=1 size=4.79MB
| predicates: time_dim.t_hour = 8, time_dim.t_minute >= 30
| table stats: 86400 rows total
| column stats: unavailable
| hosts=1 per-host-mem=32.00MB
| tuple-ids=2 row-size=16B cardinality=8640
|
05:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ss_hdemo_sk = household_demographics.hd_demo_sk
| runtime filters: RF001 <- household_demographics.hd_demo_sk
| hosts=2 per-host-mem=9.28KB
| tuple-ids=0,3,1 row-size=64B cardinality=22033737
|
|--08:EXCHANGE [BROADCAST]
| | hosts=1 per-host-mem=0B
| | tuple-ids=1 row-size=12B cardinality=720
| |
| 01:SCAN HDFS [tpcds_bin_partitioned_textfile_40.household_demographics, RANDOM]
| partitions=1/1 files=1 size=141.07KB
| predicates: household_demographics.hd_dep_count = 5
| table stats: 7200 rows total
| column stats: unavailable
| hosts=1 per-host-mem=32.00MB
| tuple-ids=1 row-size=12B cardinality=720
|
04:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ss_store_sk = s_store_sk
| runtime filters: RF002 <- s_store_sk
| hosts=2 per-host-mem=339B
| tuple-ids=0,3 row-size=52B cardinality=22033737
|
|--07:EXCHANGE [BROADCAST]
| | hosts=1 per-host-mem=0B
| | tuple-ids=3 row-size=28B cardinality=11
| |
| 03:SCAN HDFS [tpcds_bin_partitioned_textfile_40.store, RANDOM]
| partitions=1/1 files=1 size=29.34KB
| predicates: store.s_store_name = 'ese'
| table stats: 112 rows total
| column stats: all
| hosts=1 per-host-mem=32.00MB
| tuple-ids=3 row-size=28B cardinality=11
|
00:SCAN HDFS [tpcds_bin_partitioned_textfile_40.store_sales, RANDOM]
partitions=1824/1824 files=1824 size=13.84GB
runtime filters: RF000 -> ss_sold_time_sk, RF001 -> ss_hdemo_sk, RF002 -> ss_store_sk
table stats: 115203420 rows total
column stats: all
hosts=2 per-host-mem=384.00MB
tuple-ids=0 row-size=24B cardinality=115203420
----------------

******************************************************************

Exec Summary (What to read and how to interpret from below sheet)

Operator          #Hosts  Avg Time  Max Time   #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
---------------------------------------------------------------------------------------------------------------------------
10:EXCHANGE            1   1.492ms   1.492ms  10.00K      10.00K          0        -1.00 B  UNPARTITIONED                  
06:HASH JOIN           0   0.000ns   0.000ns       0      10.00K          0      148.50 KB  INNER JOIN, BROADCAST          
|--09:EXCHANGE         0   0.000ns   0.000ns       0       8.64K          0              0  BROADCAST                      
|  02:SCAN HDFS        1  55.911ms  55.911ms   1.80K       8.64K    8.07 MB       32.00 MB  tpcds_bin_partitioned_textf... 
05:HASH JOIN           0   0.000ns   0.000ns       0      22.03M          0        9.28 KB  INNER JOIN, BROADCAST          
|--08:EXCHANGE         0   0.000ns   0.000ns       0         720          0              0  BROADCAST                      
|  01:SCAN HDFS        1  32.422ms  32.422ms     720         720  284.00 KB       32.00 MB  tpcds_bin_partitioned_textf... 
04:HASH JOIN           0   0.000ns   0.000ns       0      22.03M          0       339.00 B  INNER JOIN, BROADCAST          
|--07:EXCHANGE         0   0.000ns   0.000ns       0          11          0              0  BROADCAST                      
|  03:SCAN HDFS        1  50.174ms  50.174ms      15          11  108.00 KB       32.00 MB  tpcds_bin_partitioned_textf... 
00:SCAN HDFS           0   0.000ns   0.000ns       0     115.20M          0      384.00 MB  tpcds_bin_partitioned_textf...

************************************************

Profile is very big output therefore not posting here.

Plese help me understanding above data. So far I am just reading articles but not able to come to concrete conclusion, everything is just imagination.

Tim Armstrong · ‎04-19-2019

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_explain_plan.html#explain_pla... is our high level doc. I would recommend starting with summary to understand where time is spent, then using the profile to drill down into individual nodes. WorkloadXM can help a lot automate the analysis process to understand bottlenecks.

Cloudera Community

Support Questions

How to understand / analyse Impala Query Text Plan, Summary, Profilie and Explain Plan