I execute the following sql in the 6 nodes impala cluster (kudus are also installed in each nodes).
select * from small_table e left join big_table s on e.sessionlogid=s.sessionlogid
the small_table with 200 million rows, and the big_table with 2 billion. Both of them are stored in kudu. Each node with 252GB memory, the mem_limit of impala is 70%.
In the same cluster, I did the following 2 test:
1. set BUFFER_POOL_LIMIT=80%
the WriteIoBytes=0, the free memory is 48GB, it cost about 14min to print out about 60000 rows.
2. set BUFFER_POOL_LIMIT=8G
the WriteIoBytes=43G (the temporary file in scratch_dirs of each node is also 43G), the free memory is 29GB, it cost about 14min to print out about 60000 rows.
From the above testing, I found the memory won't be released after decreased the buffer_pool_limit, and it also occupied more disk space. Is this make sense? If I run the spark application in the same nodes, will the performance spark app be influnced?
My impala version is cdh 5.13.
Additional info: I also found when executing the sql in the different resource pools, between them are influenced deeply (If only execute one sql, it costs 17min, if execute them at the same time, it costs more than 1 hours).
The sum resource for each resource pool are equals to the mem_limit, and each sql using the different node as coordinator. The scene confused me. Should I change another default options?
Edited: after set BUFFER_POOL_LIMIT=8G (WriteIoBytes=80GB) for one sql, the sqls will use more memory than without limit. The detail is as below:
A couple of basic questions before I look more deeply.
* Do the tables you're joining have up-to-date stats? If they don't have stats you may just be getting a bad plan.
* Can you provide text query profiles for the queries? They will be too big to paste here, so I'd recommend uploading them somewhere. E.g. a github gist.
1. The stats is up to date.
2. I uploaded 4 profiles in git
2-1). sql1_only_one_execute_profile_in_default_pool: it means at the same time only execute one query in the default resource pool, it costs 8min, the max(cost) is 17min.
2-2). sql1_execute_with_sql2_profile_in_default_pool: it means at the same time execute two queries, this is in the default resource pool, and the sql2 in the dev resource pool. it costs more 2 hours.
2-3). sql2_execute_with_sql1_profile_in_dev_pool: this is executing with 2-2), and did not limit the buffer pool. it costs more 2 hours.
2-4) sql2_execute_with_sql1_profile_in_dev_pool_buffer_limit_8G: this is executing with 2-2), and limit the buffer pool to 8G, it occupied more memory.
When executed the above sqls, the resource for each resource pool did not change. And when #rows fetched in impala ui more than 1, the sql is treated as finished.
It is my fault. I found the root cause: the reason is the sql2 choose the wrong execute strategy, it try to broadcast the big table, and the network is full. After specified the shuffle hint, both of them executed fast.
I also re-test the BUFFER_POOL_LIMIT, it works. When I set it as 2G, it executed so slowly (After set to 2G, the other sql, which without buffer pool limit, also executing slowly, I'm not sure which resource is the bottleneck, I hope you can give me some advices, too). But I still have the following doubt：
1. How to limit the broadcast memory？
2. Is the broadcast memory for one sql belong to the resource pool, which the sql executing in? For example: resource pool memory: 10G, one sql in it used brodcast memory 9G, how many memory can I use in this pool, 1G or 10G?
I also re-test the BUFFER_POOL_LIMIT, it works. When I set it as 2G, it executed so slowly (After set to 2G, the other sql, which without buffer pool limit, also executing slowly, I'm not sure which resource is the bottleneck, I hope you can give me some advices, too).
For this issue, I think the root cause is: sql1 occupied the memory in cached, and sql2 need to contend for getting the resource. Following is the test steps:
Why the sql1 release the resource immediately? Here is the profile for sql1 and sql2.
Thanks for the update. The broadcast join thing explains a lot - the planner is making a suboptimal decision about broadcast versus partitioned, because it does not factor in the buffer_pool_limit setting by design and therefore not costing the spilling to disk.
Fortunately, I have a suggestion that should avoid the problem. I recommend that you set mem_limit for the query to the maximum amount of memory you want to use. If the hash tables for the join don't fit in the mem_limit setting, the planner will automatically switch to a partitioned join. I recommend not setting buffer_pool_limit at all - it's really an internal option used for testing and for working around specific problems. If you set mem_limit, the buffer pool memory consumption is automatically constrained.
As a best practice, we recommend always setting a query memory limit to ensure that . You can do this automatically using dynamic resource pools - see https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_resource_pools.html#concept_xk... . E.g. if you only have a single "default" pool, you can set the " Default Query Memory Limit" for that pool.
Also, to address another question you had earlier, it's expected that the Impala daemon process memory consumption can increase with
Setting buffer_pool_limit or mem_limit per query does not limit the memory consumption of the Impala daemon process as a whole, because some cached data is not counted against the query's memory consumption. The cached data can be evicted from memory if another query needs it, but if you only have one query running on the system, that won't happen.
The only way to reliably limit the memory consumption of the Impala daemon process is by setting the Impala daemon's memory limit.
When I change the sql1 to partitioned, and executed them at the same time, I found it will cost a lot of time to evict the cached data from memory, here is the test:
Is this caused by the impala internal limit?