Support Questions

Find answers, ask questions, and share your expertise

Memory limit exceeded cannot perform hash join

avatar
Contributor

Hello Everyone,

 

I am running a cluster with 2 impalads, with memory capacity of 60GB for each impalad. I am using CDH 5.7 package and managing the cluster using CM. I had come across an query with below error - 

 

Memory limit exceeded Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 2. Number of rows 53969358.

 

Query:

 

SELECT `dim_experiment`.`experiment_name` AS `experiment_name`
FROM `gwynniebee_bi`.`fact_recommendation_events` `fact_recommendatio`
  LEFT OUTER JOIN `gwynniebee_bi`.`dim_experiment` `dim_experiment` ON (`fact_recommendatio`.`experiment_key` = `dim_experiment`.`experiment_key`)
GROUP BY 1

 

Profile:

 

----------------
Estimated Per-Host Requirements: Memory=2.05GB VCores=3
WARNING: The following tables are missing relevant table and/or column statistics.
gwynniebee_bi.fact_recommendation_events

09:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: dim_experiment.experiment_name ASC
|  hosts=1 per-host-mem=unavailable
|  tuple-ids=3 row-size=38B cardinality=41
|
04:SORT
|  order by: dim_experiment.experiment_name ASC
|  hosts=1 per-host-mem=16.00MB
|  tuple-ids=3 row-size=38B cardinality=41
|
08:AGGREGATE [FINALIZE]
|  group by: dim_experiment.experiment_name
|  hosts=1 per-host-mem=10.00MB
|  tuple-ids=2 row-size=38B cardinality=41
|
07:EXCHANGE [HASH(dim_experiment.experiment_name)]
|  hosts=1 per-host-mem=0B
|  tuple-ids=2 row-size=38B cardinality=41
|
03:AGGREGATE [STREAMING]
|  group by: dim_experiment.experiment_name
|  hosts=1 per-host-mem=10.00MB
|  tuple-ids=2 row-size=38B cardinality=41
|
02:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]
|  hash predicates: dim_experiment.experiment_key = fact_recommendatio.experiment_key
|  runtime filters: RF000 <- fact_recommendatio.experiment_key
|  hosts=1 per-host-mem=2.00GB
|  tuple-ids=1N,0 row-size=46B cardinality=unavailable
|
|--06:EXCHANGE [HASH(fact_recommendatio.experiment_key)]
|  |  hosts=2 per-host-mem=0B
|  |  tuple-ids=0 row-size=4B cardinality=unavailable
|  |
|  00:SCAN HDFS [gwynniebee_bi.fact_recommendation_events fact_recommendatio, RANDOM]
|     partitions=1/1 files=90 size=3.96GB
|     table stats: unavailable
|     column stats: unavailable
|     hosts=2 per-host-mem=56.00MB
|     tuple-ids=0 row-size=4B cardinality=unavailable
|
05:EXCHANGE [HASH(dim_experiment.experiment_key)]
|  hosts=1 per-host-mem=0B
|  tuple-ids=1 row-size=42B cardinality=78
|
01:SCAN HDFS [gwynniebee_bi.dim_experiment dim_experiment, RANDOM]
   partitions=1/1 files=1 size=10.56KB
   runtime filters: RF000 -> dim_experiment.experiment_key
   table stats: 78 rows total
   column stats: all
   hosts=1 per-host-mem=32.00MB
   tuple-ids=1 row-size=42B cardinality=78


I was trying to correlate this issue with available memory, in use etc with CM metrics like tcmalloc_bytes_in_use, mem_rss etc. Around sametime, I am seeing very less usage on memory being used for impalad process. 

 

Below graph shows the above metrics around the sametime (when query had memory error)

 

impala_mem_usage.png

 

Am I missing any other metric to look for? Please share your thoughts.

 

Thanks,

Mani

4 REPLIES 4

avatar
Super Guru

I can see that the stats for table "gwynniebee_bi.fact_recommendation_events" was not available:

 

|     table stats: unavailable
|     column stats: unavailable

Without stats, Impala will not be able to do accurate estimate in terms of resources required for the query, this sometimes will lead to OOM error due to wrong estimation of memroy usage up front.

 

If you check the SUMMARY of the query, you will get detaild information about estimated and actual memory usage.

 

I suggest you to run "COMPUTE STATS  gwynniebee_bi.fact_recommendation_events" and try again.

avatar

Eric's suggestion is the general solution to this problem - without stats Impala is choosing a bad join order and there are a lot of duplicates on the right side of the join.

 

One workaround is to add a straight_join hint, which lets you control the order in which the tables are joined. I believe in your case just adding straight_join will flip the sides of the join, which will almost certainly help you.

 

SELECT `dim_experiment`.`experiment_name` AS `experiment_name`
FROM `gwynniebee_bi`.`fact_recommendation_events` `fact_recommendatio`
  LEFT OUTER JOIN `gwynniebee_bi`.`dim_experiment` `dim_experiment` ON (`fact_recommendatio`.`experiment_key` = `dim_experiment`.`experiment_key`)
GROUP BY 1

 

avatar

We are facing a similar issue and its a blocker for us now. I've also tried with straight join hint but still getting the same error,

 

Query Status: Cannot perform hash join at node with id 20. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 1. Number of rows 185442.

avatar

@spurusothamanusually we go through a couple of steps to troubleshoot issues like this. The two most likely solutions are:

1. Give the query more memory by increasing mem_limit or reducing # of concurrent queries

2. Adjust the SQL by rewriting the query or adding hints to get a different query plan to that avoids having so many duplicate values on the right side of the join.

 

Depending on the exact scenario, the solution might be 1, 2, or both.

 

straight_join is only useful if you use it to force a plan with a different join order. If you want input on whether you have a bad plan and what a better join order might be, please provide a query profile.