I'm running a cluster with 8 worker nodes configured with 160GB Impala Daemon Memory Limit. The worker nodes each have 370GB RAM and based on a look at the standard Host Memory Usage graph from Cloudera Manager for the nodes, it looks like I have capacity for additional query space.
My question: Does it look like I have room to increase my Impala values to meet my needs? From my viewpoint, I think I have at least another 100GB of headroom, but I don't want to impact Hive or Spark processing that may occur during the same time windows.
I'd like to accomplish the following:
Over the past week, The nodes' host memory usage graph contains the following example peaks:
During a quiet time, the numbers look like:
Increasing the impala memory limit isn't likely to have any negative consequences for Impala. The only potential downside based on what you posted is that reducing the memory available to the linux to us for I/O caching could slow down scans.
> I'd like to allow some queries that tend to overreach on Impala RAM additional capacity to do what they need to do. These queries read some big tables, sometimes with thousands of partitions, and they have a tendency to run out of RAM.
If you haven't already, it's probably worth trying to tune the queries, since they're still going to be relatively slow even if they stay fully in memory. There no reason queries over big tables are necessarily memory intensive, it's usually only certain operations or suboptimal plans.
There's some high-level advice here if you haven't already seen it: https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_scalability.html#spill_to_di... . The most common problem I see leading to unnecessary spilling is large right inputs to hash joins. If you can tweak the query to reduce that amount of data on that side of the join, or if the join order/strategy in the plan is suboptimal, you can get dramatic improvements. You probably already checked this, but if you don't have stats on all tables in the query, that's the first thing to try addressing.
> Currently, I don't have any admission control settings enabled. Any query can use all the available resources. I'd like to increase the available RAM for all of Impala while limiting the RAM for individual queries.
You didn't say what version you're using, but CDH6.1+ have some admission control settings that will automatically allocate variable amounts of memory to a query, within a min/max memory limit range that is configurable by the cluster admin. This is probably useful for you