Executor memory size is shared among several components like join hash table loading, map-side aggregation hash tables, PTF operator buffering, shuffle and sort buffers. We recommend 4Gb for a general case (per LLAP Sizing and Setup), configuring less memory can slow down some queries (e.g. hash joins will turn into sort merge joins) and configuring more memory can be wasteful or lead to GC pressure (although more memory can provide more room for hash joins, it can adversely affect query performance when all/many executors load huge hash tables resulting in severe GC pauses).
LLAP adds 2 new changes for hash table loaders for map joins
Oversubscription of memory (from other executors) to provide more room for in-memory hash tables and more map join conversions.
Memory monitoring during hashtable loading monitors hash table's in-memory size for fair usage. If memory monitor finds an executor using more memory beyonds its limits during hashtable loading, the query will fail with mapjoin memory exhaustion error.
The way oversubscription of memory works for map join hash tables is, every executor borrows 20% of hive.auto.convert.join.noconditionaltask.size from self and 3 other executors configurable via hive.llap.mapjoin.memory.oversubscribe.factor and hive.llap.memory.oversubscription.max.executors.per.query respectively.
For example, let’s say hive.auto.convert.join.noconditionaltask.size is set to 1GB, in the oversubscription model, map join conversions will inflate this 1GB no-conditional task size to [1GB + (4 * 0.2 * 1GB)] = 1.8GB, where 4 is self + 3 other executors to borrow memory from and 0.2 is 20% of no-conditional task size. Hive query compiler in LLAP mode will take this oversubscribed memory (1.8GB) into consideration when making decision about map join conversions. As a result in LLAP mode, one can observe more map join conversions because of this new oversubscribed memory of 1.8GB.
The way memory monitoring for hash table loader works is, after loading every 100000 rows (configurable via hive.llap.mapjoin.memory.monitor.check.interval) into hash tables, memory monitor will compute the memory usage of hash tables. If memory usage exceeds certain threshold (2x oversubscribed no-conditional task size = 3.6GB in above example), memory monitor will kill the query as it exceeds its memory usage limits. 2x is inflation factor to accommodate additional java object allocation overhead configurable via hive.hash.table.inflation.factor.
It is recommended not to change the default configs for memory oversubscription and memory monitoring as it can significantly impact current query performance or performance of other queries. Reducing the oversubscription memory fraction or additional executors to borrow, will result in less queries being converted to hash joins which can slow down the query performance. On the other hand, increasing these values will result in increased memory usage during hashtable loading which can lead to severe garbage collection when multiple executors are loading hash tables at the same time and can also lead unfair/inflated memory usage only for subset of executors also leading to degraded query performance for all queries.
As always, it is recommended to keep table, partition and column statistics up-to-date for Hive query compiler to generate optimal query plan. If statistics are not present, in some cases MapJoinMemoryExhaustionError may be thrown like below
In the above error, we can see that Map 2 is broadcasting its data to hash table loader which failed after loading 102000000 rows for exhausting its memory limit of 3664143872 bytes (2 * [1017817776 + 0.8*1017817776]). This is likely happening because of missing statistics that prevents optimizer from making better decision (in this case MAP 2 could have been a shuffle join with correct statistics).