Created on 12-04-2017 07:56 PM - edited 09-16-2022 01:41 AM
This article is intended for understanding map join memory sizing and related configurations specific to LLAP.
For detailed information about LLAP sizing and setup, please refer to LLAP Sizing and Setup
Executor memory size is shared among several components like join hash table loading, map-side aggregation hash tables, PTF operator buffering, shuffle and sort buffers. We recommend 4Gb for a general case (per LLAP Sizing and Setup), configuring less memory can slow down some queries (e.g. hash joins will turn into sort merge joins) and configuring more memory can be wasteful or lead to GC pressure (although more memory can provide more room for hash joins, it can adversely affect query performance when all/many executors load huge hash tables resulting in severe GC pauses).
LLAP adds 2 new changes for hash table loaders for map joins
The way oversubscription of memory works for map join hash tables is, every executor borrows 20% of hive.auto.convert.join.noconditionaltask.size from self and 3 other executors configurable via hive.llap.mapjoin.memory.oversubscribe.factor and hive.llap.memory.oversubscription.max.executors.per.query respectively.
For example, let’s say hive.auto.convert.join.noconditionaltask.size is set to 1GB, in the oversubscription model, map join conversions will inflate this 1GB no-conditional task size to [1GB + (4 * 0.2 * 1GB)] = 1.8GB, where 4 is self + 3 other executors to borrow memory from and 0.2 is 20% of no-conditional task size. Hive query compiler in LLAP mode will take this oversubscribed memory (1.8GB) into consideration when making decision about map join conversions. As a result in LLAP mode, one can observe more map join conversions because of this new oversubscribed memory of 1.8GB.
The way memory monitoring for hash table loader works is, after loading every 100000 rows (configurable via hive.llap.mapjoin.memory.monitor.check.interval) into hash tables, memory monitor will compute the memory usage of hash tables. If memory usage exceeds certain threshold (2x oversubscribed no-conditional task size = 3.6GB in above example), memory monitor will kill the query as it exceeds its memory usage limits. 2x is inflation factor to accommodate additional java object allocation overhead configurable via hive.hash.table.inflation.factor.
It is recommended not to change the default configs for memory oversubscription and memory monitoring as it can significantly impact current query performance or performance of other queries. Reducing the oversubscription memory fraction or additional executors to borrow, will result in less queries being converted to hash joins which can slow down the query performance. On the other hand, increasing these values will result in increased memory usage during hashtable loading which can lead to severe garbage collection when multiple executors are loading hash tables at the same time and can also lead unfair/inflated memory usage only for subset of executors also leading to degraded query performance for all queries.
As always, it is recommended to keep table, partition and column statistics up-to-date for Hive query compiler to generate optimal query plan. If statistics are not present, in some cases MapJoinMemoryExhaustionError may be thrown like below
Caused by: org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError: VectorMapJoin Hash table loading exceeded memory limits for input: Map 2 numEntries: 102000000 estimatedMemoryUsage: 4294983912 effectiveThreshold: 3664143872 memoryMonitorInfo: { isLlap: true executorsPerNode: 24 maxExecutorsOverSubscribeMemory: 3 memoryOverSubscriptionFactor: 0.20000000298023224 memoryCheckInterval: 100000 noConditionalTaskSize: 1017817776 adjustedNoConditionalTaskSize: 1832071997 hashTableInflationFactor: 2.0 threshold: 3664143872 }
In the above error, we can see that Map 2 is broadcasting its data to hash table loader which failed after loading 102000000 rows for exhausting its memory limit of 3664143872 bytes (2 * [1017817776 + 0.8*1017817776]). This is likely happening because of missing statistics that prevents optimizer from making better decision (in this case MAP 2 could have been a shuffle join with correct statistics).