Created on 10-23-2019 11:23 AM - edited 10-23-2019 11:25 AM
CDH 5.15.0
CentOS 6.10 final
Hey,
My internal team members have been using an enterprise CM environment installed on a cluster with more-than-adequate hardware (used to be customer-facing and handle multiple large queries at once), but they have been complaining about their scripts and queries failing inconsistently due to out-of-memory errors. This is occurring with both users with memory limits as well as users with free access to the entire cluster's resources. An example output is shown below. Is this a known issue with the current CDH version? The reason I'm raising this concern is because this cluster used to run smoothly under much heavier query load and query concurrency, and now it seems to be a roll of the dice every time a non-tiny query is run.
Memory limit exceeded: Error occurred on backend <hostname> by fragment b84dc213ea94e53d:a98ab78000000ad Memory left in process limit: 125.63 GB Memory left in query limit: -130.89 KB Query(b84dc213ea94e53d:a98ab7800000000): memory limit exceeded. Limit=1.00 GB Reservation=441.88 MB ReservationLimit=819.20 MB OtherMemory=582.25 MB Total=1.00 GB Peak=1.00 GB Unclaimed reservations: Reservation=112.00 MB OtherMemory=0 Total=112.00 MB Peak=237.75 MB Fragment b84dc213ea94e53d:a98ab7800000141: Reservation=0 OtherMemory=57.64 KB Total=57.64 KB Peak=1.57 MB AGGREGATION_NODE (id=49): Total=42.12 KB Peak=42.12 KB Exprs: Total=42.12 KB Peak=42.12 KB EXCHANGE_NODE (id=48): Reservation=0 OtherMemory=0 Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=50): Total=424.00 B Peak=424.00 B CodeGen: Total=7.10 KB Peak=1.52 MB Fragment b84dc213ea94e53d:a98ab7800000122: Reservation=0 OtherMemory=10.32 MB Total=10.32 MB Peak=14.39 MB AGGREGATION_NODE (id=32): Total=42.12 KB Peak=42.12 KB Exprs: Total=42.12 KB Peak=42.12 KB HASH_JOIN_NODE (id=31): Total=142.25 KB Peak=142.25 KB Exprs: Total=31.12 KB Peak=31.12 KB Hash Join Builder (join_node_id=31): Total=31.12 KB Peak=31.12 KB Hash Join Builder (join_node_id=31) Exprs: Total=31.12 KB Peak=31.12 KB EXCHANGE_NODE (id=46): Reservation=0 OtherMemory=10.09 MB Total=10.09 MB Peak=10.09 MB DataStreamRecvr: Total=10.09 MB Peak=10.09 MB EXCHANGE_NODE (id=47): Reservation=0 OtherMemory=0 Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=48): Total=12.11 KB Peak=12.11 KB CodeGen: Total=31.34 KB Peak=4.59 MB Fragment b84dc213ea94e53d:a98ab780000006b: Reservation=34.00 MB OtherMemory=17.89 MB Total=51.89 MB Peak=51.89 MB HASH_JOIN_NODE (id=30): Reservation=34.00 MB OtherMemory=2.60 MB Total=36.60 MB Peak=36.60 MB Exprs: Total=43.12 KB Peak=43.12 KB Hash Join Builder (join_node_id=30): Total=39.12 KB Peak=63.12 KB Hash Join Builder (join_node_id=30) Exprs: Total=39.12 KB Peak=39.12 KB EXCHANGE_NODE (id=37): Reservation=0 OtherMemory=10.04 MB Total=10.04 MB Peak=10.04 MB DataStreamRecvr: Total=10.04 MB Peak=10.04 MB EXCHANGE_NODE (id=38): Reservation=0 OtherMemory=0 Total=0 Peak=1.20 MB DataStreamRecvr: Total=0 Peak=1.20 MB DataStreamSender (dst_id=46): Total=2.85 MB Peak=3.61 MB CodeGen: Total=11.39 KB Peak=1.51 MB Fragment b84dc213ea94e53d:a98ab7800000034: Reservation=1.94 MB OtherMemory=409.38 MB Total=411.32 MB Peak=411.32 MB HASH_JOIN_NODE (id=29): Reservation=1.94 MB OtherMemory=6.95 MB Total=8.89 MB Peak=12.12 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=29): Total=21.12 KB Peak=45.12 KB Hash Join Builder (join_node_id=29) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=0): Total=393.15 MB Peak=393.15 MB Exprs: Total=4.00 KB Peak=4.00 KB EXCHANGE_NODE (id=35): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB DataStreamRecvr: Total=0 Peak=4.02 KB DataStreamSender (dst_id=37): Total=3.03 MB Peak=6.07 MB DataStreamSender (dst_id=37) Exprs: Total=4.00 KB Peak=4.00 KB CodeGen: Total=12.10 KB Peak=1.76 MB Fragment b84dc213ea94e53d:a98ab780000001f: Reservation=0 OtherMemory=0 Total=0 Peak=3.51 MB HASH_JOIN_NODE (id=6): Reservation=0 OtherMemory=0 Total=0 Peak=2.02 MB Hash Join Builder (join_node_id=6): Total=0 Peak=37.12 KB HDFS_SCAN_NODE (id=5): Total=0 Peak=326.00 KB EXCHANGE_NODE (id=34): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB DataStreamRecvr: Total=0 Peak=4.02 KB DataStreamSender (dst_id=35): Total=0 Peak=177.28 KB CodeGen: Total=0 Peak=1.53 MB Fragment b84dc213ea94e53d:a98ab7800000056: Reservation=0 OtherMemory=0 Total=0 Peak=22.23 MB SELECT_NODE (id=11): Total=0 Peak=1.02 MB ANALYTIC_EVAL_NODE (id=10): Reservation=0 OtherMemory=0 Total=0 Peak=5.54 MB ANALYTIC_EVAL_NODE (id=9): Reservation=0 OtherMemory=0 Total=0 Peak=4.53 MB SORT_NODE (id=8): Reservation=0 OtherMemory=0 Total=0 Peak=12.12 MB EXCHANGE_NODE (id=36): Reservation=0 OtherMemory=0 Total=0 Peak=2.04 MB DataStreamRecvr: Total=0 Peak=2.04 MB DataStreamSender (dst_id=38): Total=0 Peak=1.02 MB CodeGen: Total=0 Peak=1.13 MB Fragment b84dc213ea94e53d:a98ab780000004a: Reservation=0 OtherMemory=0 Total=0 Peak=144.13 KB HDFS_SCAN_NODE (id=7): Total=0 Peak=109.00 KB DataStreamSender (dst_id=36): Total=0 Peak=30.91 KB CodeGen: Total=0 Peak=52.50 KB Fragment b84dc213ea94e53d:a98ab7800000103: Reservation=258.00 MB OtherMemory=1.68 MB Total=259.68 MB Peak=259.68 MB SELECT_NODE (id=28): Total=4.00 KB Peak=4.00 KB Exprs: Total=4.00 KB Peak=4.00 KB ANALYTIC_EVAL_NODE (id=27): Total=4.00 KB Peak=4.00 KB Exprs: Total=4.00 KB Peak=4.00 KB ANALYTIC_EVAL_NODE (id=26): Total=4.00 KB Peak=4.00 KB Exprs: Total=4.00 KB Peak=4.00 KB SORT_NODE (id=25): Reservation=258.00 MB OtherMemory=293.67 KB Total=258.29 MB Peak=258.29 MB EXCHANGE_NODE (id=45): Reservation=0 OtherMemory=1.33 MB Total=1.33 MB Peak=10.01 MB DataStreamRecvr: Total=1.35 MB Peak=10.01 MB DataStreamSender (dst_id=47): Total=49.41 KB Peak=49.41 KB CodeGen: Total=3.51 KB Peak=1.03 MB Fragment b84dc213ea94e53d:a98ab78000000e4: Reservation=34.00 MB OtherMemory=2.40 MB Total=36.40 MB Peak=44.01 MB HASH_JOIN_NODE (id=24): Reservation=34.00 MB OtherMemory=355.95 KB Total=34.35 MB Peak=34.35 MB Exprs: Total=43.12 KB Peak=43.12 KB Hash Join Builder (join_node_id=24): Total=39.12 KB Peak=55.12 KB Hash Join Builder (join_node_id=24) Exprs: Total=39.12 KB Peak=39.12 KB EXCHANGE_NODE (id=43): Reservation=0 OtherMemory=1.12 MB Total=1.12 MB Peak=8.75 MB DataStreamRecvr: Total=1.12 MB Peak=8.75 MB EXCHANGE_NODE (id=44): Reservation=0 OtherMemory=0 Total=0 Peak=821.20 KB DataStreamRecvr: Total=0 Peak=821.20 KB DataStreamSender (dst_id=45): Total=669.34 KB Peak=789.34 KB DataStreamSender (dst_id=45) Exprs: Total=8.00 KB Peak=8.00 KB CodeGen: Total=11.46 KB Peak=1.53 MB Fragment b84dc213ea94e53d:a98ab78000000ad: Reservation=1.94 MB OtherMemory=140.59 MB Total=142.53 MB Peak=178.54 MB HASH_JOIN_NODE (id=23): Reservation=1.94 MB OtherMemory=1.12 MB Total=3.05 MB Peak=4.28 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=23): Total=21.12 KB Peak=45.12 KB Hash Join Builder (join_node_id=23) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=12): Total=137.83 MB Peak=174.28 MB Exprs: Total=4.00 KB Peak=4.00 KB EXCHANGE_NODE (id=41): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB DataStreamRecvr: Total=0 Peak=4.02 KB DataStreamSender (dst_id=43): Total=643.78 KB Peak=971.78 KB DataStreamSender (dst_id=43) Exprs: Total=4.00 KB Peak=4.00 KB CodeGen: Total=12.06 KB Peak=1.73 MB Fragment b84dc213ea94e53d:a98ab7800000098: Reservation=0 OtherMemory=0 Total=0 Peak=3.40 MB HASH_JOIN_NODE (id=18): Reservation=0 OtherMemory=0 Total=0 Peak=2.02 MB Hash Join Builder (join_node_id=18): Total=0 Peak=37.12 KB HDFS_SCAN_NODE (id=17): Total=0 Peak=210.00 KB EXCHANGE_NODE (id=40): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB DataStreamRecvr: Total=0 Peak=4.02 KB DataStreamSender (dst_id=41): Total=0 Peak=177.28 KB CodeGen: Total=0 Peak=1.53 MB Fragment b84dc213ea94e53d:a98ab78000000cf: Reservation=0 OtherMemory=0 Total=0 Peak=18.08 MB SELECT_NODE (id=22): Total=0 Peak=528.00 KB ANALYTIC_EVAL_NODE (id=21): Reservation=0 OtherMemory=0 Total=0 Peak=4.53 MB SORT_NODE (id=20): Reservation=0 OtherMemory=0 Total=0 Peak=12.10 MB EXCHANGE_NODE (id=42): Reservation=0 OtherMemory=0 Total=0 Peak=1.37 MB DataStreamRecvr: Total=0 Peak=1.37 MB DataStreamSender (dst_id=44): Total=0 Peak=1.52 MB CodeGen: Total=0 Peak=876.00 KB Fragment b84dc213ea94e53d:a98ab78000000c3: Reservation=0 OtherMemory=0 Total=0 Peak=126.65 KB HDFS_SCAN_NODE (id=19): Total=0 Peak=81.02 KB DataStreamSender (dst_id=42): Total=0 Peak=41.41 KB CodeGen: Total=0 Peak=52.50 KB
Created on 10-23-2019 02:27 PM - edited 10-23-2019 02:28 PM
It looks like there was plenty of memory available in the system, that query just hit its individual memory limit.
There were a lot of improvements to avoid out-of-memory between 5.15 and 6.1, particularly for queries with a lot of scans that use a significant amount of memory. It looks like one of the scans was using a large chunk of the query memory:
HDFS_SCAN_NODE (id=0): Total=393.15 MB Peak=393.15 MB
There's one specific regression that I'm aware of that affected Avro scans: https://issues.apache.org/jira/browse/IMPALA-7078. The fix is in 5.15.1 and 5.15.2. I don't know the file format but thought I'd flag that. The IMPALA-7078 fix actually had a few tweaks that would benefit all file formats too.
So I'd suggest:
1 GB might just not be enough to run a query with that many operators on the version of Impala that you're running.