I'm new to Impala and have some basic questions about how to read an Explain plan. This is on a cluster with 7 nodes, running Impala 2.1.
select ... about 130 columns...
FROM raw_prod.wallet_summary_key wsk
JOIN prod.transaction_detail td ON td.transaction_key = wsk.transaction_key
AND td.merchant_month_of_year = wsk.merchant_month_of_year
LEFT JOIN raw_prod.wallet_detail_attributes wda ON wsk.transaction_key = wda.transaction_key
AND wsk.merchant_month_of_year = wda.merchant_month_of_year
Estimated Per-Host Requirements: Memory=6.33GB VCores=3 WARNING: The following tables are missing relevant table and/or column statistics. raw_prod.wallet_detail_attributes, raw_prod.wallet_summary_key 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [LEFT OUTER JOIN, BROADCAST] | hash predicates: wsk.transaction_key = wda.transaction_key, wsk.merchant_month_of_year = wda.merchant_month_of_year | |--06:EXCHANGE [BROADCAST] | | | 02:SCAN HDFS [raw_prod.wallet_detail_attributes wda] | partitions=1/1 files=1 size=112.57MB | 03:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: td.transaction_key = wsk.transaction_key, td.merchant_month_of_year = wsk.merchant_month_of_year | |--05:EXCHANGE [BROADCAST] | | | 00:SCAN HDFS [raw_prod.wallet_summary_key wsk] | partitions=1/1 files=5 size=11.90MB | 01:SCAN HDFS [prod.transaction_detail td] partitions=412/412 files=1348 size=414.21GB
Estimated Per-Host Requirements - is a Host an Impala Daemon? For this query, is estimated cluster memory requirement 7 x 6.33GB = 44GB? Or is VCores also a factor? 7 x 6.33 x 3 = 133GB ?
The first step is a scan on table transaction_detail. It's reading the entire thing at 414GB, so why isn't the memory requirement higher? Does the act of scanning not load it into memory? Or is only a portion of the table scanned at a time?
The memory requirements are only an estimate and can be off in either direction for various reason. In this case the number seems too high if anything.
We definitely don't load the entire dataset into memory: Impala generally processes data in a streaming manner, where we only keep data in memory when it's required for processing. E.g. in your query below we need to bring the right side of both joins into memory** but once those are in memory, we can stream data through the pipeline of nodes 01->03->04->07 back to the client.
So in your case we need to fit wallet_detail_attributes and wallet_summary_key in memory. The compressed size on disk is 124MB total, so it will be larger uncompressed in memory, but it seems like both should add up to less than 6.33GB anyway, unless you're getting really awesome compression.
**actually, we don't necessarily need to fit the right sides in memory, we can divide up the join inputs, write some subset out to disk and process a subset at a time in in-memory.