Created on 11-29-2017 09:05 PM - edited 08-17-2019 09:58 AM
I'd like to thank @Jean-Philippe Player, @bpreachuk, @ghagleitner, @gopal, @ndembla and @Prasanth Jayachandran for providing input and content for this article.
This document describes LLAP setup for reasonable performance with a typical workload. It is intended as a starting point, not as the definitive answer to all tuning questions. However, recommendations in this document have been applied both to our own instances as well as instances we help other teams to run.
Each step below will be accompanied by an example based on allocating ~50% of a cluster to LLAP, out of 12 nodes with 16 cores and 96Gb dedicated to YARN NodeManager each.
This is determined by the workload (amount of data, concurrency, types of queries, desired latencies); typical values can be in 15-50% range on an existing cluster, or it is possible to give entire cluster to LLAP, depending on customer use case.
LLAP runs 3 types of YARN containers in the cluster:
a) Execution daemons which process the data
b) query coordinators (Tez AMs) orchestrating the execution of the queriues
and c) a single Slider AM starting and monitoring the daemons.
The bulk of the capacity allotted for LLAP will be used by the Execution daemons. For best results, whole YARN nodes should be allocated for these daemons; therefore, you should pick the number of whole nodes to use based on the fraction of the cluster to be dedicated to LLAP.
Example: we want to use 50% of a 12-node the cluster, so we will have 6 LLAP daemons.
This will depend heavily on the application; one thing to be aware of is that with BI tools, the number of running queries at any moment in time is typically much smaller than the number of sessions -users don’t run queries all the time because they take time to look at the results and some tools avoid queries using caching.
The number of queries determines the number of query coordinators (Tez AMs) - one per query + some constant slack in case sessions are temporarily unavailable due to maintenance or cluster issues.
Example: we have determined we need to run 9 queries in parallel; use 10 AMs.
As described above, we will give whole NM nodes to LLAP daemons; however, we also need to leave some space on the cluster for query coordinators and the slider AM.
There are two options. First, it is possible to run these on non-LLAP nodes. In this case, each LLAP daemon can take the entire memory of a NM (see YARN config tab in Ambari, as "Memory allocated for all YARN containers on a node". 😞
If that is acceptable, it's the optimal approach (it maximizes daemon memory and avoids noisy neighbors). However,
The alternative approach is to run AMs on the nodes with LLAP daemons. In this case, LLAP daemon container size needs to be adjusted down to leave space:
There are (N_Tez_AMs + 1)/N_LLAP_nodes extra processes on each node; we recommend 2Gb per AM process. So, for this case, LLAP daemon will use (NM_memory – 2Gb * ceiling((N_Tez_AMs + 1)/N_LLAP_nodes)) .
Example: We will use the 2nd approach; we need ceiling((10+1)/6) = 2 AM processes per LLAP node, reserving 2 * 2Gb per node. Therefore, LLAP daemon size will be 96Gb – 4Gb = 92Gb.
YARN settings for LLAP, as well as the LLAP queue are usually set up by Ambari; if setting up manually or encountering problems, there are a few things to ensure:
When setting up a custom LLAP queue:
There are three numbers that determine the memory configuration for LLAP daemons.
Total memory = off-heap cache memory + Xmx (used by executors and daemon-specific overhead) + JVM overhead . The Xmx overhead (shared Java objects and internal structures) is accounted for by LLAP internally, so it can be ignored in the calculations.
In most cases, you can just follow the standard steps below ; however, sometimes the process needs to be tweaked. The following conflicting criteria apply:
So, number of executors, cache size, and then memory per executor are balance of these 3:
Usually the same as the number of cores.
Example: there are 16 cores, so we will use 16 executors.
Daemon uses almost all of its heap memory (Xmx) for processing query work; the shared overheads are accounted for by LLAP. Thus, we determine the daemon Xmx based on executor memory; you should aim to give a single executor around 4Gb. So, Xmx = n_executors * 4Gb. LLAP executors use memory for things like group by tables, sort buffers, shuffle buffers, map join hashtables, PTF, etc.; generally, the more memory, the less likely they are to spill, or to go with a less efficient plan, e.g. switching from a map to shuffle join - for some background, see https://community.hortonworks.com/articles/149998/map-join-memory-sizing-for-llap.html
If there isn't enough memory on the machine, by default, adjust the number of executors down. It is possible to give executors less memory (e.g. 3Gb) to keep their number higher, however, that must always be backed by testing with the target workload. Insufficient memory can cause much bigger perf problems than reduced task parallelism.
Additionally, if IO layer is enabled, each executor would need about ~200Mb of cache (cache buffers are also used as IO buffer pool) for reasonable operation. The cache size will be covered in the next section.
Example: Xmx = 16
executors * 4 Gb = 64 Gb
Additionally, based on 16 executors, we know that it's good to have (0.2 Gb *
16) = 3.2 Gb cache if we are going to enable it.
We need to leave some memory in the daemon for Java VM overhead (GC structures, etc.) as well as to give the container a little safety margin to prevent it being killed by YARN. This headroom should be about 6% of Xmx, but no more than 6 Gb . This is not set directly and is instead used to calculate the container memory remaining for the use by off-heap cache.
Take out Xmx and the headroom from the total container size to determine cache size. If the cache size is below the viable size above, reduce Xmx to allow for more memory here (typically by reducing the number of executors) or disable LLAP IO (hive.llap.io.enabled = false).
Example:
Xmx is 64 Gb, so headroom = 0.06 * 64 Gb = ~4 Gb (which is lower than 6 Gb).
Cache size = 92 Gb total – 64 Gb Xmx – 4 Gb headroom = 24 Gb. This is above the minimum of 3.2 Gb.
You can specify the above settings on the main interactive query panel, as well as in advanced config (or validate the values predetermined by Ambari).
Some other settings from the advanced config that may be of interest:
Created on 12-06-2017 07:04 PM
Outstanding article!
Created on 12-07-2017 07:33 PM
Thank you 🙂
Created on 12-27-2017 04:39 PM
Thanks a lot for this article. This is exactly what I needed !
Created on 05-04-2018 01:30 PM
where do we set 4gb executor memory and is there a parameter to set the executor memory.
Created on 05-28-2018 04:02 PM
Created on 05-29-2018 08:03 PM
Yes, I think that should be sufficient. You can set it higher if you see issues (like container getting killed by YARN) but 12G definitely seems excessive for 32Gb Xmx. I'll try to find out why the engine would recommend such high values.
Created on 01-07-2019 12:00 PM
Brilliant article. One suggestion. The example you gave in "Configure interactive query" isn't matching with example you discussed. Can you please modify it.