04-26-2016 07:31 PM - last edited on 11-10-2016 05:20 AM by cjervis
i use three node with 64 memory to run join operation. But the the memory of one node is exceed no matter i set the mem_limit properties with the values of -1B,60G,50G or 40G. While other nodes only use 10G+ memory. Here is the part of profiles:
Query (id=6043a6bbde2f997a:3b3072b5dcfd1d89): Summary: Session ID: 4141715fae32f3b5:5d20c4f989cea7b8 Session Type: BEESWAX Start Time: 2016-04-26 18:11:31.958572000 End Time: 2016-04-27 04:19:46.109358000 Query Type: DML Query State: EXCEPTION Query Status: Memory limit exceeded Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353. Impala Version: impalad version 2.3.0-cdh5.5.0 RELEASE (build 0c891d79aa38f297d244855a32f1e17280e2129b) User: root Connected User: root Delegated User: Network Address: ::ffff:192.168.55.247:40383 Default Db: default Sql Statement: insert into table result select parquet_bigdatabench_dw_order_300g.buyer_id,sum(parquet_bigdatabench_dw_item_300g.goods_amount) as total from parquet_bigdatabench_dw_order_300g join [shuffle] parquet_bigdatabench_dw_item_300g on parquet_bigdatabench_dw_item_300g.order_id = parquet_bigdatabench_dw_order_300g.order_id group by parquet_bigdatabench_dw_order_300g.buyer_id limit 10 Coordinator: bigdata3:22000 Plan: ---------------- Estimated Per-Host Requirements: Memory=39.27GB VCores=2 F04:PLAN FRAGMENT [UNPARTITIONED] WRITE TO HDFS [default.result, OVERWRITE=false] | partitions=1 | hosts=1 per-host-mem=unavailable | 08:EXCHANGE [UNPARTITIONED] limit: 10 hosts=3 per-host-mem=unavailable tuple-ids=2 row-size=12B cardinality=10 F03:PLAN FRAGMENT [HASH(parquet_bigdatabench_dw_order_300g.buyer_id)] DATASTREAM SINK [FRAGMENT=F04, EXCHANGE=08, UNPARTITIONED] 07:AGGREGATE [FINALIZE] | output: sum:merge(parquet_bigdatabench_dw_item_300g.goods_amount) | group by: parquet_bigdatabench_dw_order_300g.buyer_id | limit: 10 | hosts=3 per-host-mem=9.21GB | tuple-ids=2 row-size=12B cardinality=2247426048 | 06:EXCHANGE [HASH(parquet_bigdatabench_dw_order_300g.buyer_id)] hosts=3 per-host-mem=0B tuple-ids=2 row-size=12B cardinality=2247426048 F02:PLAN FRAGMENT [HASH(parquet_bigdatabench_dw_item_300g.order_id)] DATASTREAM SINK [FRAGMENT=F03, EXCHANGE=06, HASH(parquet_bigdatabench_dw_order_300g.buyer_id)] 03:AGGREGATE | output: sum(parquet_bigdatabench_dw_item_300g.goods_amount) | group by: parquet_bigdatabench_dw_order_300g.buyer_id | hosts=3 per-host-mem=27.63GB | tuple-ids=2 row-size=12B cardinality=2247426048 | 02:HASH JOIN [INNER JOIN, PARTITIONED] | hash predicates: parquet_bigdatabench_dw_item_300g.order_id = parquet_bigdatabench_dw_order_300g.order_id | hosts=3 per-host-mem=11.47GB | tuple-ids=1,0 row-size=20B cardinality=4103971316 | |--05:EXCHANGE [HASH(parquet_bigdatabench_dw_order_300g.order_id)] | hosts=3 per-host-mem=0B | tuple-ids=0 row-size=8B cardinality=4200000000 | 04:EXCHANGE [HASH(parquet_bigdatabench_dw_item_300g.order_id)] hosts=3 per-host-mem=0B tuple-ids=1 row-size=12B cardinality=4200000000 F01:PLAN FRAGMENT [RANDOM] DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=05, HASH(parquet_bigdatabench_dw_order_300g.order_id)] 00:SCAN HDFS [default.parquet_bigdatabench_dw_order_300g, RANDOM] partitions=1/1 files=87 size=21.15GB table stats: 4200000000 rows total column stats: all hosts=3 per-host-mem=176.00MB tuple-ids=0 row-size=8B cardinality=4200000000 F00:PLAN FRAGMENT [RANDOM] DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, HASH(parquet_bigdatabench_dw_item_300g.order_id)] 01:SCAN HDFS [default.parquet_bigdatabench_dw_item_300g, RANDOM] partitions=1/1 files=258 size=63.82GB table stats: 4200000000 rows total column stats: all hosts=3 per-host-mem=176.00MB tuple-ids=1 row-size=12B cardinality=4200000000 ---------------- Estimated Per-Host Mem: 42170573209 Estimated Per-Host VCores: 2 Admission result: Admitted immediately Request Pool: root.root ExecSummary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail --------------------------------------------------------------------------------------------------------------------------- 08:EXCHANGE 1 10h8m 10h8m 0 10 0 -1.00 B UNPARTITIONED 07:AGGREGATE 3 135.686ms 407.60ms 0 2.25B 155.03 MB 9.21 GB FINALIZE 06:EXCHANGE 3 15.975ms 24.767ms 1.20M 2.25B 0 0 HASH(parquet_bigdatabench_d... 03:AGGREGATE 3 887.849ms 1s340ms 1.20M 2.25B 155.02 MB 27.63 GB 02:HASH JOIN 3 3h19m 9h53m 1.50M 4.10B 31.46 GB 11.47 GB INNER JOIN, PARTITIONED |--05:EXCHANGE 3 1m2s 2m5s 4.20B 4.20B 0 0 HASH(parquet_bigdatabench_d... | 00:SCAN HDFS 3 12s695ms 16s494ms 4.20B 4.20B 485.76 MB 176.00 MB default.parquet_bigdatabenc... 04:EXCHANGE 3 59s722ms 2m59s 4.20B 4.20B 0 0 HASH(parquet_bigdatabench_d... 01:SCAN HDFS 3 14s341ms 19s831ms 4.20B 4.20B 205.20 MB 176.00 MB default.parquet_bigdatabenc...
04-27-2016 04:00 PM
It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is:
"Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353."
In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all.
First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help.
Second, it will try to spill some of the rows to disk and process a subset of those.
Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.
04-28-2016 02:43 AM - edited 04-28-2016 03:14 AM
Tim,thank you for reply, yes ,I see i really have the data skew.But how can i to avoid it? By the way, how can i get the details of the partitioned join?
04-28-2016 08:48 AM
It looks like it's some kind of benchmark data? Is it a publicly available benchmark. It seems strange that they would have such skewed data.
I think even if it did stay in memory it would run for an incredibly long time: if 2 billion keys in one table get matches to 2 billion keys in another table then you will get 10^18 output rows (over a quintillion). So I think either there is something strange with the benchmark data or it doesn't make sense to join the tables.
We don't have the exact join algorithm documented aside from in the Impala code. It's a version of the Hybrid hash join https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
04-28-2016 08:50 AM
The algorithm is described in the Impala source code here if you (or anyone else reading) is interested: