Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Explain plan basic questions

Explain plan basic questions

Explorer

I'm new to Impala and have some basic questions about how to read an Explain plan. This is on a cluster with 7 nodes, running Impala 2.1.

 

Example query:

select ... about 130 columns...

FROM raw_prod.wallet_summary_key wsk
          JOIN prod.transaction_detail td ON td.transaction_key = wsk.transaction_key
          AND td.merchant_month_of_year = wsk.merchant_month_of_year
          LEFT JOIN raw_prod.wallet_detail_attributes wda ON wsk.transaction_key = wda.transaction_key
          AND wsk.merchant_month_of_year = wda.merchant_month_of_year

 

Estimated Per-Host Requirements: Memory=6.33GB VCores=3
WARNING: The following tables are missing relevant table and/or column statistics.
raw_prod.wallet_detail_attributes, raw_prod.wallet_summary_key

07:EXCHANGE [UNPARTITIONED]
|
04:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
|  hash predicates: wsk.transaction_key = wda.transaction_key, wsk.merchant_month_of_year = wda.merchant_month_of_year
|
|--06:EXCHANGE [BROADCAST]
|  |
|  02:SCAN HDFS [raw_prod.wallet_detail_attributes wda]
|     partitions=1/1 files=1 size=112.57MB
|
03:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: td.transaction_key = wsk.transaction_key, td.merchant_month_of_year = wsk.merchant_month_of_year
|
|--05:EXCHANGE [BROADCAST]
|  |
|  00:SCAN HDFS [raw_prod.wallet_summary_key wsk]
|     partitions=1/1 files=5 size=11.90MB
|
01:SCAN HDFS [prod.transaction_detail td]
   partitions=412/412 files=1348 size=414.21GB

 

Estimated Per-Host Requirements  - is a Host an Impala Daemon? For this query, is estimated cluster memory requirement 7 x 6.33GB = 44GB? Or is VCores also a factor? 7 x 6.33 x 3 = 133GB ?

 

The first step is a scan on table transaction_detail. It's reading the entire thing at 414GB, so why isn't the memory requirement higher? Does the act of scanning not load it into memory? Or is only a portion of the table scanned at a time?

1 REPLY 1

Re: Explain plan basic questions

Master Collaborator

The memory requirements are only an estimate and can be off in either direction for various reason. In this case the number seems too high if anything.

 

We definitely don't load the entire dataset into memory: Impala generally processes data in a streaming manner, where we only keep data in memory when it's required for processing. E.g. in your query below we need to bring the right side of both joins into memory** but once those are in memory, we can stream data through the pipeline of nodes 01->03->04->07 back to the client.

 

So in your case we need to fit wallet_detail_attributes and wallet_summary_key in memory. The compressed size on disk is 124MB total, so it will be larger uncompressed in memory, but it seems like both should add up to less than 6.33GB anyway, unless you're getting really awesome compression.

 

**actually, we don't necessarily need to fit the right sides in memory, we can divide up the join inputs, write some subset out to disk and process a subset at a time in in-memory.

Don't have an account?
Coming from Hortonworks? Activate your account here