Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to set MEM_LIMIT based on explain plan

avatar

Followed the Cloudera guide to set up and configure Impala Admission control via Dynamic Resource Pools in CDH 5.7 (Impala 2.5.0), one of the first thing is that "admin should know what kind of workflow is executed and what are the resource requests".

 

I started with this, to capping the queries wiht try and error approach. I set up a limit for a particular query to 2GB via set MEM_LIMIT option and it worked fine.

 

Later the query failed on memory limit exceeded error, however the explain clearly shows, that there are NO peak with more than few MBs. Everything is under 1GB (except the scan which has 1.08Gig)

 

I had to raise the limit from 2GB to 4GB. Now the query runs ok.

 

What concerns me is the fact, that relying on explain summary is not enough and I dont know where to look for a safe threshold.

 

Exec Summary
Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
---------------------------------------------------------------------------------------------------------------------------------
37:EXCHANGE               5   74.246ms  367.995ms  810.69K       3.62M          0              0  HASH(CAST(substring(field... 
23:SELECT                 5   15.465ms   23.090ms  810.69K       3.62M    3.01 MB              0                                 
22:ANALYTIC               5  191.768ms  263.507ms  811.66K       3.62M   23.01 MB              0                                 
21:SORT                   5    1s492ms    1s802ms  811.66K       3.62M  112.64 MB      240.00 MB                                 
36:EXCHANGE               5   66.684ms   74.809ms  811.66K       3.62M          0              0  HASH(d.eid)                 
20:HASH JOIN              5  233.074ms  404.226ms  811.66K       3.62M    2.10 MB       331.00 B  LEFT OUTER JOIN, BROADCAST     
|--35:EXCHANGE            5   12.292us   14.792us       71           7          0              0  BROADCAST                      
|  11:SCAN HDFS           1    6.251ms    6.251ms       71           7   84.00 KB       64.00 MB  trg.table1 tp         
19:HASH JOIN              5   45.483ms   57.714ms  811.66K       3.62M    2.10 MB       103.00 B  LEFT OUTER JOIN, BROADCAST     
|--34:EXCHANGE            5   15.827us   17.612us      441           2          0              0  BROADCAST                      
|  10:SCAN HDFS           1   23.226ms   23.226ms      441           2  115.00 KB       80.00 MB  trg.table2 o        
18:HASH JOIN              5   81.817ms  103.238ms  811.66K       3.62M    2.27 MB        2.53 KB  LEFT OUTER JOIN, BROADCAST     
|--33:EXCHANGE            5   10.905us   16.528us      116          78          0              0  BROADCAST                      
|  09:SCAN HDFS           1   21.119ms   21.119ms      116          78   77.00 KB       48.00 MB  trg.table3... 
17:HASH JOIN              5  238.710ms  310.216ms  810.69K       3.62M    3.09 MB       34.79 KB  LEFT OUTER JOIN, BROADCAST     
|--32:EXCHANGE            5   14.653us   17.272us      506         506          0              0  BROADCAST                      
|  08:SCAN HDFS           1    5.079ms    5.079ms      506         506  121.17 KB       48.00 MB  trg.table4 s1   
16:HASH JOIN              5   37.020ms   48.938ms  810.69K       3.62M    2.06 MB      154.41 KB  LEFT OUTER JOIN, BROADCAST     
|--31:EXCHANGE            5    7.393us   12.096us        4       3.78K          0              0  BROADCAST                      
|  07:SCAN HDFS           1    6.101ms    6.101ms        4       3.78K  958.15 KB       64.00 MB  tmp.table8 s2                
15:HASH JOIN              5   21.639ms   29.371ms  810.69K       3.62M    2.05 MB      154.41 KB  LEFT OUTER JOIN, BROADCAST     
|--30:EXCHANGE            5    8.122us   11.664us        1       3.78K          0              0  BROADCAST                      
|  06:SCAN HDFS           1    8.067ms    8.067ms        1       3.78K  958.15 KB       64.00 MB  tmp.table8 s3                
14:HASH JOIN              5   82.441ms   99.157ms  810.69K       3.62M    2.53 MB      154.41 KB  LEFT OUTER JOIN, BROADCAST     
|--29:EXCHANGE            5  220.902us  235.376us   11.35K       3.78K          0              0  BROADCAST                      
|  05:SCAN HDFS           1    5.741ms    5.741ms   11.35K       3.78K    1.09 MB       64.00 MB  tmp.table8 s4                
13:HASH JOIN              5  254.708ms  304.909ms  810.69K       3.62M  152.05 MB        8.64 MB  RIGHT OUTER JOIN, PARTITIONED  
|--28:EXCHANGE            5   40.477ms   58.477ms  810.69K      82.11K          0              0  HASH(CAST(d.entid AS INT))   
|  12:HASH JOIN           7  226.257ms  359.085ms  810.69K      82.11K    2.09 MB       184.00 B  LEFT OUTER JOIN, BROADCAST     
|  |--26:EXCHANGE         7    9.622us   11.480us        9           9          0              0  BROADCAST                      
|  |  03:SCAN HDFS        1    4.366ms    4.366ms        9           9   47.00 KB       32.00 MB  trg.table5 ro        
|  02:HASH JOIN           7   23.812ms   35.389ms  810.69K      82.11K    2.20 MB        2.16 MB  LEFT OUTER JOIN, PARTITIONED   
|  |--25:EXCHANGE         7   91.400us  131.127us   10.44K      82.11K          0              0  HASH(d_sur.reference,d_sur.... 
|  |  01:SCAN HDFS        7  410.599ms  520.945ms   10.44K      82.11K   13.21 MB      168.00 MB  tmp.table8_detail d_sur      
|  24:EXCHANGE            7   14.274ms   20.817ms  810.69K      82.11K          0              0  HASH(d.reference,d.mnid,d... 
|  00:SCAN HDFS           7  495.925ms  697.044ms  810.69K      82.11K   52.63 MB        1.08 GB  tmp.table8_detail d          
27:EXCHANGE               5   26.006ms   34.625ms    3.62M       3.62M          0              0  HASH(c.intid)         
04:SCAN HDFS              5  239.460ms  369.856ms    3.62M       3.62M   35.71 MB       48.00 MB  tmp.table7 c

 

Based on this explain summary some would say 1.08GB is a peak, so 1.5GB limit would be safe. But not even close.  

Any thoughts where to look for a real consumption?

Thanks 

1 ACCEPTED SOLUTION

avatar

Thomas, you have a legitimate request and concern.

 

First, there is no perfectly fool-proof solution because the resource consumption is somewhat dependent on what happens at runtime, and not all memory consumption is tracked by Impala (but must is). We are constantly making improvements in this area though.

 

1. I'd recommend fixing the num_scanner_threads for your queries. A different number of scanner threads can result in different memory consumption from run to run (and dependent on what else is going on in the system at the time).

2. The operators of a query do not run one-by-one. Some of them run concurrently (e.g. join builds may execute concurrently). So just looking at the highest peak in the exec summary is not enough. Taking the sum of the peaks over all operators is a safer bet, but tends to overestimate the actual consumption.

 

Hope this helps!

View solution in original post

4 REPLIES 4

avatar

And even more confusing is this:

 

Limit is way bigger than the consumption, but still failing..

 

Memory Limit Exceeded
Query(a04dc11b92580986:99f26ad466311b9f) Limit: Limit=4.00 GB Consumption=1.48 GB
  Fragment a04dc11b92580986:99f26ad466311baa: Consumption=48.03 MB
    SELECT_NODE (id=29): Consumption=0
    ANALYTIC_EVAL_NODE (id=28): Consumption=0
    SORT_NODE (id=27): Consumption=48.02 MB
    EXCHANGE_NODE (id=44): Consumption=0
    DataStreamRecvr: Consumption=0
  Block Manager: Limit=3.20 GB Consumption=528.00 MB
  Fragment a04dc11b92580986:99f26ad466311bb0: Consumption=988.46 MB
    HASH_JOIN_NODE (id=26): Consumption=0
    HASH_JOIN_NODE (id=25): Consumption=0
    HASH_JOIN_NODE (id=24): Consumption=0
    NESTED_LOOP_JOIN_NODE (id=23): Consumption=494.08 MB
      Exprs: Consumption=494.00 MB
    HASH_JOIN_NODE (id=22): Consumption=72.00 KB
    HASH_JOIN_NODE (id=21): Consumption=64.00 KB
    NESTED_LOOP_JOIN_NODE (id=20): Consumption=494.06 MB
      Exprs: Consumption=494.00 MB
    HASH_JOIN_NODE (id=19): Consumption=76.00 KB
      Exprs: Consumption=28.00 KB
    HASH_JOIN_NODE (id=18): Consumption=40.00 KB
    HASH_JOIN_NODE (id=17): Consumption=32.00 KB
    NESTED_LOOP_JOIN_NODE (id=16): Consumption=32.00 KB
    HASH_JOIN_NODE (id=15): Consumption=16.00 KB
    EXCHANGE_NODE (id=31): Consumption=0
    DataStreamRecvr: Consumption=0
    EXCHANGE_NODE (id=32): Consumption=0
    DataStreamRecvr: Consumption=0
    EXCHANGE_NODE (id=33): Consumption=0

avatar
Champion

Can you link the guide you are following.  Admission Control and dynamic resource pools are seperate but can function together.  Unless I am mistaken, MEM_LIMIT is the memory limit for each Impala Daemon.  Ok, just did a quick read it can be used at run time per query.  It will set the limit for a specific query.  Is this were you are setting it?  This has nothing to do with Admission Control or DRP.

 

For DRP you can set default_pool_mem_limit to cap how much memory can be used in a pool used by specific users/groups/queries.

 

 

avatar

Yes setting the memory limit per query. My point was, that even when I set a much higher limit via SET MEM_LIMIT the query fails.
Example:
I set 3GB. The query fails and in the error message there is only 1GB consumption. I assume that the metrics about the memory consumption are not accurate, so ok, I assume that the query eated more than 3 gigs.
But then I raise the limit to 5GB and the query succeeds. So I check the real execution plan, and there is no evidence of using more than 3GB, lets say there is a peak memory usage 1.5GB. I dont have real exec plans or profiles, but beleive me, the output is not correct.

Regarding the admission control, this is the link:
https://blog.cloudera.com/blog/2016/12/resource-management-for-apache-impala-incubating/

"Cloudera recommends that you set memory limits for queries whenever possible. Default memory limits can be set on all queries in a pool (see below), and explicitly specifying the query memory limit will override the default. If you are using admission control pools with restrictions on the maximum memory available to queries in a pool, setting default per-query memory limits is required."

https://blog.cloudera.com/blog/2016/12/resource-management-for-apache-impala-incubating/
Another cloudera documentation states clearly, that if DO NOT SET the mem limit, then the default query limit is "allocated". So If I have 100GB of total memory, 5 nodes, and I set 10GB as a default, then if two users run a query in the same time (even tiny ones) then the third query will go to the queue and will wait.
"
A dynamic resource pool has Max Memory set to 100 GB.
The Default Query Memory Limit for the pool is 10 GB. Therefore, any query running in this pool could use up to 50 GB of memory (default query memory limit * number of Impala nodes).
The maximum number of queries that Impala executes concurrently within this dynamic resource pool is two, which is the most that could be accomodated within the 100 GB Max Memory cluster-wide limit."

So what I would like to do is to measure EXACTLY the peak memory and SET it to the value that is safe enough to go through.

The problem is, that explain plans, and error messages are not correct, or I am not understanding the whole memory limit concept at all.

T.

avatar

Thomas, you have a legitimate request and concern.

 

First, there is no perfectly fool-proof solution because the resource consumption is somewhat dependent on what happens at runtime, and not all memory consumption is tracked by Impala (but must is). We are constantly making improvements in this area though.

 

1. I'd recommend fixing the num_scanner_threads for your queries. A different number of scanner threads can result in different memory consumption from run to run (and dependent on what else is going on in the system at the time).

2. The operators of a query do not run one-by-one. Some of them run concurrently (e.g. join builds may execute concurrently). So just looking at the highest peak in the exec summary is not enough. Taking the sum of the peaks over all operators is a safer bet, but tends to overestimate the actual consumption.

 

Hope this helps!