Support Questions

Find answers, ask questions, and share your expertise

Impala Queries Out-of-Memory Stability Issues

avatar

CDH 5.15.0

CentOS 6.10 final

 

Hey, 

 

My internal team members have been using an enterprise CM environment installed on a cluster with more-than-adequate hardware (used to be customer-facing and handle multiple large queries at once), but they have been complaining about their scripts and queries failing inconsistently due to out-of-memory errors. This is occurring with both users with memory limits as well as users with free access to the entire cluster's resources. An example output is shown below. Is this a known issue with the current CDH version? The reason I'm raising this concern is because this cluster used to run smoothly under much heavier query load and query concurrency, and now it seems to be a roll of the dice every time a non-tiny query is run. 

 

Memory limit exceeded: Error occurred on backend <hostname> by fragment b84dc213ea94e53d:a98ab78000000ad
Memory left in process limit: 125.63 GB
Memory left in query limit: -130.89 KB
Query(b84dc213ea94e53d:a98ab7800000000): memory limit exceeded. Limit=1.00 GB Reservation=441.88 MB ReservationLimit=819.20 MB OtherMemory=582.25 MB Total=1.00 GB Peak=1.00 GB
  Unclaimed reservations: Reservation=112.00 MB OtherMemory=0 Total=112.00 MB Peak=237.75 MB
  Fragment b84dc213ea94e53d:a98ab7800000141: Reservation=0 OtherMemory=57.64 KB Total=57.64 KB Peak=1.57 MB
    AGGREGATION_NODE (id=49): Total=42.12 KB Peak=42.12 KB
      Exprs: Total=42.12 KB Peak=42.12 KB
    EXCHANGE_NODE (id=48): Reservation=0 OtherMemory=0 Total=0 Peak=0
      DataStreamRecvr: Total=0 Peak=0
    DataStreamSender (dst_id=50): Total=424.00 B Peak=424.00 B
    CodeGen: Total=7.10 KB Peak=1.52 MB
  Fragment b84dc213ea94e53d:a98ab7800000122: Reservation=0 OtherMemory=10.32 MB Total=10.32 MB Peak=14.39 MB
    AGGREGATION_NODE (id=32): Total=42.12 KB Peak=42.12 KB
      Exprs: Total=42.12 KB Peak=42.12 KB
    HASH_JOIN_NODE (id=31): Total=142.25 KB Peak=142.25 KB
      Exprs: Total=31.12 KB Peak=31.12 KB
      Hash Join Builder (join_node_id=31): Total=31.12 KB Peak=31.12 KB
        Hash Join Builder (join_node_id=31) Exprs: Total=31.12 KB Peak=31.12 KB
    EXCHANGE_NODE (id=46): Reservation=0 OtherMemory=10.09 MB Total=10.09 MB Peak=10.09 MB
      DataStreamRecvr: Total=10.09 MB Peak=10.09 MB
    EXCHANGE_NODE (id=47): Reservation=0 OtherMemory=0 Total=0 Peak=0
      DataStreamRecvr: Total=0 Peak=0
    DataStreamSender (dst_id=48): Total=12.11 KB Peak=12.11 KB
    CodeGen: Total=31.34 KB Peak=4.59 MB
  Fragment b84dc213ea94e53d:a98ab780000006b: Reservation=34.00 MB OtherMemory=17.89 MB Total=51.89 MB Peak=51.89 MB
    HASH_JOIN_NODE (id=30): Reservation=34.00 MB OtherMemory=2.60 MB Total=36.60 MB Peak=36.60 MB
      Exprs: Total=43.12 KB Peak=43.12 KB
      Hash Join Builder (join_node_id=30): Total=39.12 KB Peak=63.12 KB
        Hash Join Builder (join_node_id=30) Exprs: Total=39.12 KB Peak=39.12 KB
    EXCHANGE_NODE (id=37): Reservation=0 OtherMemory=10.04 MB Total=10.04 MB Peak=10.04 MB
      DataStreamRecvr: Total=10.04 MB Peak=10.04 MB
    EXCHANGE_NODE (id=38): Reservation=0 OtherMemory=0 Total=0 Peak=1.20 MB
      DataStreamRecvr: Total=0 Peak=1.20 MB
    DataStreamSender (dst_id=46): Total=2.85 MB Peak=3.61 MB
    CodeGen: Total=11.39 KB Peak=1.51 MB
  Fragment b84dc213ea94e53d:a98ab7800000034: Reservation=1.94 MB OtherMemory=409.38 MB Total=411.32 MB Peak=411.32 MB
    HASH_JOIN_NODE (id=29): Reservation=1.94 MB OtherMemory=6.95 MB Total=8.89 MB Peak=12.12 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
      Hash Join Builder (join_node_id=29): Total=21.12 KB Peak=45.12 KB
        Hash Join Builder (join_node_id=29) Exprs: Total=21.12 KB Peak=21.12 KB
    HDFS_SCAN_NODE (id=0): Total=393.15 MB Peak=393.15 MB
      Exprs: Total=4.00 KB Peak=4.00 KB
    EXCHANGE_NODE (id=35): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB
      DataStreamRecvr: Total=0 Peak=4.02 KB
    DataStreamSender (dst_id=37): Total=3.03 MB Peak=6.07 MB
      DataStreamSender (dst_id=37) Exprs: Total=4.00 KB Peak=4.00 KB
    CodeGen: Total=12.10 KB Peak=1.76 MB
  Fragment b84dc213ea94e53d:a98ab780000001f: Reservation=0 OtherMemory=0 Total=0 Peak=3.51 MB
    HASH_JOIN_NODE (id=6): Reservation=0 OtherMemory=0 Total=0 Peak=2.02 MB
      Hash Join Builder (join_node_id=6): Total=0 Peak=37.12 KB
    HDFS_SCAN_NODE (id=5): Total=0 Peak=326.00 KB
    EXCHANGE_NODE (id=34): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB
      DataStreamRecvr: Total=0 Peak=4.02 KB
    DataStreamSender (dst_id=35): Total=0 Peak=177.28 KB
    CodeGen: Total=0 Peak=1.53 MB
  Fragment b84dc213ea94e53d:a98ab7800000056: Reservation=0 OtherMemory=0 Total=0 Peak=22.23 MB
    SELECT_NODE (id=11): Total=0 Peak=1.02 MB
    ANALYTIC_EVAL_NODE (id=10): Reservation=0 OtherMemory=0 Total=0 Peak=5.54 MB
    ANALYTIC_EVAL_NODE (id=9): Reservation=0 OtherMemory=0 Total=0 Peak=4.53 MB
    SORT_NODE (id=8): Reservation=0 OtherMemory=0 Total=0 Peak=12.12 MB
    EXCHANGE_NODE (id=36): Reservation=0 OtherMemory=0 Total=0 Peak=2.04 MB
      DataStreamRecvr: Total=0 Peak=2.04 MB
    DataStreamSender (dst_id=38): Total=0 Peak=1.02 MB
    CodeGen: Total=0 Peak=1.13 MB
  Fragment b84dc213ea94e53d:a98ab780000004a: Reservation=0 OtherMemory=0 Total=0 Peak=144.13 KB
    HDFS_SCAN_NODE (id=7): Total=0 Peak=109.00 KB
    DataStreamSender (dst_id=36): Total=0 Peak=30.91 KB
    CodeGen: Total=0 Peak=52.50 KB
  Fragment b84dc213ea94e53d:a98ab7800000103: Reservation=258.00 MB OtherMemory=1.68 MB Total=259.68 MB Peak=259.68 MB
    SELECT_NODE (id=28): Total=4.00 KB Peak=4.00 KB
      Exprs: Total=4.00 KB Peak=4.00 KB
    ANALYTIC_EVAL_NODE (id=27): Total=4.00 KB Peak=4.00 KB
      Exprs: Total=4.00 KB Peak=4.00 KB
    ANALYTIC_EVAL_NODE (id=26): Total=4.00 KB Peak=4.00 KB
      Exprs: Total=4.00 KB Peak=4.00 KB
    SORT_NODE (id=25): Reservation=258.00 MB OtherMemory=293.67 KB Total=258.29 MB Peak=258.29 MB
    EXCHANGE_NODE (id=45): Reservation=0 OtherMemory=1.33 MB Total=1.33 MB Peak=10.01 MB
      DataStreamRecvr: Total=1.35 MB Peak=10.01 MB
    DataStreamSender (dst_id=47): Total=49.41 KB Peak=49.41 KB
    CodeGen: Total=3.51 KB Peak=1.03 MB
  Fragment b84dc213ea94e53d:a98ab78000000e4: Reservation=34.00 MB OtherMemory=2.40 MB Total=36.40 MB Peak=44.01 MB
    HASH_JOIN_NODE (id=24): Reservation=34.00 MB OtherMemory=355.95 KB Total=34.35 MB Peak=34.35 MB
      Exprs: Total=43.12 KB Peak=43.12 KB
      Hash Join Builder (join_node_id=24): Total=39.12 KB Peak=55.12 KB
        Hash Join Builder (join_node_id=24) Exprs: Total=39.12 KB Peak=39.12 KB
    EXCHANGE_NODE (id=43): Reservation=0 OtherMemory=1.12 MB Total=1.12 MB Peak=8.75 MB
      DataStreamRecvr: Total=1.12 MB Peak=8.75 MB
    EXCHANGE_NODE (id=44): Reservation=0 OtherMemory=0 Total=0 Peak=821.20 KB
      DataStreamRecvr: Total=0 Peak=821.20 KB
    DataStreamSender (dst_id=45): Total=669.34 KB Peak=789.34 KB
      DataStreamSender (dst_id=45) Exprs: Total=8.00 KB Peak=8.00 KB
    CodeGen: Total=11.46 KB Peak=1.53 MB
  Fragment b84dc213ea94e53d:a98ab78000000ad: Reservation=1.94 MB OtherMemory=140.59 MB Total=142.53 MB Peak=178.54 MB
    HASH_JOIN_NODE (id=23): Reservation=1.94 MB OtherMemory=1.12 MB Total=3.05 MB Peak=4.28 MB
      Exprs: Total=21.12 KB Peak=21.12 KB
      Hash Join Builder (join_node_id=23): Total=21.12 KB Peak=45.12 KB
        Hash Join Builder (join_node_id=23) Exprs: Total=21.12 KB Peak=21.12 KB
    HDFS_SCAN_NODE (id=12): Total=137.83 MB Peak=174.28 MB
      Exprs: Total=4.00 KB Peak=4.00 KB
    EXCHANGE_NODE (id=41): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB
      DataStreamRecvr: Total=0 Peak=4.02 KB
    DataStreamSender (dst_id=43): Total=643.78 KB Peak=971.78 KB
      DataStreamSender (dst_id=43) Exprs: Total=4.00 KB Peak=4.00 KB
    CodeGen: Total=12.06 KB Peak=1.73 MB
  Fragment b84dc213ea94e53d:a98ab7800000098: Reservation=0 OtherMemory=0 Total=0 Peak=3.40 MB
    HASH_JOIN_NODE (id=18): Reservation=0 OtherMemory=0 Total=0 Peak=2.02 MB
      Hash Join Builder (join_node_id=18): Total=0 Peak=37.12 KB
    HDFS_SCAN_NODE (id=17): Total=0 Peak=210.00 KB
    EXCHANGE_NODE (id=40): Reservation=0 OtherMemory=0 Total=0 Peak=4.02 KB
      DataStreamRecvr: Total=0 Peak=4.02 KB
    DataStreamSender (dst_id=41): Total=0 Peak=177.28 KB
    CodeGen: Total=0 Peak=1.53 MB
  Fragment b84dc213ea94e53d:a98ab78000000cf: Reservation=0 OtherMemory=0 Total=0 Peak=18.08 MB
    SELECT_NODE (id=22): Total=0 Peak=528.00 KB
    ANALYTIC_EVAL_NODE (id=21): Reservation=0 OtherMemory=0 Total=0 Peak=4.53 MB
    SORT_NODE (id=20): Reservation=0 OtherMemory=0 Total=0 Peak=12.10 MB
    EXCHANGE_NODE (id=42): Reservation=0 OtherMemory=0 Total=0 Peak=1.37 MB
      DataStreamRecvr: Total=0 Peak=1.37 MB
    DataStreamSender (dst_id=44): Total=0 Peak=1.52 MB
    CodeGen: Total=0 Peak=876.00 KB
  Fragment b84dc213ea94e53d:a98ab78000000c3: Reservation=0 OtherMemory=0 Total=0 Peak=126.65 KB
    HDFS_SCAN_NODE (id=19): Total=0 Peak=81.02 KB
    DataStreamSender (dst_id=42): Total=0 Peak=41.41 KB
    CodeGen: Total=0 Peak=52.50 KB
1 REPLY 1

avatar

It looks like there was plenty of memory available in the system, that query just hit its individual memory limit.

 

There were a lot of improvements to avoid out-of-memory between 5.15 and 6.1, particularly for queries with a lot of scans that use a significant amount of memory. It looks like one of the scans was using a large chunk of the query memory:

 

    HDFS_SCAN_NODE (id=0): Total=393.15 MB Peak=393.15 MB

 

There's one specific regression that I'm aware of that affected Avro scans: https://issues.apache.org/jira/browse/IMPALA-7078. The fix is in 5.15.1 and 5.15.2. I don't know the file format but thought I'd flag that. The IMPALA-7078 fix actually had a few tweaks that would benefit all file formats too.

 

So I'd suggest:

  • Give the queries a bit more memory - in practice we've seen 2GB be a lot better with a wider variety of queries in CDH5.x. 1GB is a bit squeezy for a query with 49 operators.
  • Pick up the 5.15.2 or 5.16.2 maintenance releases to get the fix for IMPALA-7078 - that may be enough to solve the problem.
  • Look at CDH6.1, it does address a bunch of issues in this area more systematically - it moves the scan operations to use a much more robust memory throttling/reservation system (I spent a bunch of time last year working on problems in this general area).

1 GB might just not be enough to run a query with that many operators on the version of Impala that you're running.