Support Questions

lizhenmxcz · ‎04-26-2016

Hi:

i use three node with 64 memory to run join operation. But the the memory of one node is exceed no matter i set the mem_limit properties with the values of -1B,60G,50G or 40G. While other nodes only use 10G+ memory. Here is the part of profiles:

Query (id=6043a6bbde2f997a:3b3072b5dcfd1d89):
  Summary:
    Session ID: 4141715fae32f3b5:5d20c4f989cea7b8
    Session Type: BEESWAX
    Start Time: 2016-04-26 18:11:31.958572000
    End Time: 2016-04-27 04:19:46.109358000
    Query Type: DML
    Query State: EXCEPTION
    Query Status: 
Memory limit exceeded
Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353.

    Impala Version: impalad version 2.3.0-cdh5.5.0 RELEASE (build 0c891d79aa38f297d244855a32f1e17280e2129b)
    User: root
    Connected User: root
    Delegated User: 
    Network Address: ::ffff:192.168.55.247:40383
    Default Db: default
    Sql Statement: insert into table result select parquet_bigdatabench_dw_order_300g.buyer_id,sum(parquet_bigdatabench_dw_item_300g.goods_amount) as total from parquet_bigdatabench_dw_order_300g join [shuffle] parquet_bigdatabench_dw_item_300g on parquet_bigdatabench_dw_item_300g.order_id = parquet_bigdatabench_dw_order_300g.order_id group by parquet_bigdatabench_dw_order_300g.buyer_id limit 10
    Coordinator: bigdata3:22000
    Plan: 
----------------
Estimated Per-Host Requirements: Memory=39.27GB VCores=2

F04:PLAN FRAGMENT [UNPARTITIONED]
  WRITE TO HDFS [default.result, OVERWRITE=false]
  |  partitions=1
  |  hosts=1 per-host-mem=unavailable
  |
  08:EXCHANGE [UNPARTITIONED]
     limit: 10
     hosts=3 per-host-mem=unavailable
     tuple-ids=2 row-size=12B cardinality=10

F03:PLAN FRAGMENT [HASH(parquet_bigdatabench_dw_order_300g.buyer_id)]
  DATASTREAM SINK [FRAGMENT=F04, EXCHANGE=08, UNPARTITIONED]
  07:AGGREGATE [FINALIZE]
  |  output: sum:merge(parquet_bigdatabench_dw_item_300g.goods_amount)
  |  group by: parquet_bigdatabench_dw_order_300g.buyer_id
  |  limit: 10
  |  hosts=3 per-host-mem=9.21GB
  |  tuple-ids=2 row-size=12B cardinality=2247426048
  |
  06:EXCHANGE [HASH(parquet_bigdatabench_dw_order_300g.buyer_id)]
     hosts=3 per-host-mem=0B
     tuple-ids=2 row-size=12B cardinality=2247426048

F02:PLAN FRAGMENT [HASH(parquet_bigdatabench_dw_item_300g.order_id)]
  DATASTREAM SINK [FRAGMENT=F03, EXCHANGE=06, HASH(parquet_bigdatabench_dw_order_300g.buyer_id)]
  03:AGGREGATE
  |  output: sum(parquet_bigdatabench_dw_item_300g.goods_amount)
  |  group by: parquet_bigdatabench_dw_order_300g.buyer_id
  |  hosts=3 per-host-mem=27.63GB
  |  tuple-ids=2 row-size=12B cardinality=2247426048
  |
  02:HASH JOIN [INNER JOIN, PARTITIONED]
  |  hash predicates: parquet_bigdatabench_dw_item_300g.order_id = parquet_bigdatabench_dw_order_300g.order_id
  |  hosts=3 per-host-mem=11.47GB
  |  tuple-ids=1,0 row-size=20B cardinality=4103971316
  |
  |--05:EXCHANGE [HASH(parquet_bigdatabench_dw_order_300g.order_id)]
  |     hosts=3 per-host-mem=0B
  |     tuple-ids=0 row-size=8B cardinality=4200000000
  |
  04:EXCHANGE [HASH(parquet_bigdatabench_dw_item_300g.order_id)]
     hosts=3 per-host-mem=0B
     tuple-ids=1 row-size=12B cardinality=4200000000

F01:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=05, HASH(parquet_bigdatabench_dw_order_300g.order_id)]
  00:SCAN HDFS [default.parquet_bigdatabench_dw_order_300g, RANDOM]
     partitions=1/1 files=87 size=21.15GB
     table stats: 4200000000 rows total
     column stats: all
     hosts=3 per-host-mem=176.00MB
     tuple-ids=0 row-size=8B cardinality=4200000000

F00:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, HASH(parquet_bigdatabench_dw_item_300g.order_id)]
  01:SCAN HDFS [default.parquet_bigdatabench_dw_item_300g, RANDOM]
     partitions=1/1 files=258 size=63.82GB
     table stats: 4200000000 rows total
     column stats: all
     hosts=3 per-host-mem=176.00MB
     tuple-ids=1 row-size=12B cardinality=4200000000
----------------
    Estimated Per-Host Mem: 42170573209
    Estimated Per-Host VCores: 2
    Admission result: Admitted immediately
    Request Pool: root.root
    ExecSummary: 
Operator          #Hosts   Avg Time  Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
---------------------------------------------------------------------------------------------------------------------------
08:EXCHANGE            1      10h8m     10h8m      0          10          0        -1.00 B  UNPARTITIONED                  
07:AGGREGATE           3  135.686ms  407.60ms      0       2.25B  155.03 MB        9.21 GB  FINALIZE                       
06:EXCHANGE            3   15.975ms  24.767ms  1.20M       2.25B          0              0  HASH(parquet_bigdatabench_d... 
03:AGGREGATE           3  887.849ms   1s340ms  1.20M       2.25B  155.02 MB       27.63 GB                                 
02:HASH JOIN           3      3h19m     9h53m  1.50M       4.10B   31.46 GB       11.47 GB  INNER JOIN, PARTITIONED        
|--05:EXCHANGE         3       1m2s      2m5s  4.20B       4.20B          0              0  HASH(parquet_bigdatabench_d... 
|  00:SCAN HDFS        3   12s695ms  16s494ms  4.20B       4.20B  485.76 MB      176.00 MB  default.parquet_bigdatabenc... 
04:EXCHANGE            3   59s722ms     2m59s  4.20B       4.20B          0              0  HASH(parquet_bigdatabench_d... 
01:SCAN HDFS           3   14s341ms  19s831ms  4.20B       4.20B  205.20 MB      176.00 MB  default.parquet_bigdatabenc...

Tim Armstrong · ‎04-27-2016

It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is:

"Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353."

In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all.

First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help.

Second, it will try to spill some of the rows to disk and process a subset of those.

Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.

View solution in original post

Tim Armstrong · ‎04-27-2016

It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is:

"Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353."

In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all.

First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help.

Second, it will try to spill some of the rows to disk and process a subset of those.

Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.

lizhenmxcz · ‎04-28-2016

Tim,thank you for reply, yes ,I see i really have the data skew.But how can i to avoid it? By the way, how can i get the details of the partitioned join?

Tim Armstrong · ‎04-28-2016

It looks like it's some kind of benchmark data? Is it a publicly available benchmark. It seems strange that they would have such skewed data.

I think even if it did stay in memory it would run for an incredibly long time: if 2 billion keys in one table get matches to 2 billion keys in another table then you will get 10^18 output rows (over a quintillion). So I think either there is something strange with the benchmark data or it doesn't make sense to join the tables.

We don't have the exact join algorithm documented aside from in the Impala code. It's a version of the Hybrid hash join https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

Tim Armstrong · ‎04-28-2016

The algorithm is described in the Impala source code here if you (or anyone else reading) is interested:

https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/partitioned-hash-join-node.h

lizhenmxcz · ‎04-28-2016

tim, thank you very much. I think i should try another benchmark data

lizhenmxcz · ‎04-29-2016

can you give me some advise to benchmark the impala? include the tools, dataset etc.

Tim Armstrong · ‎05-02-2016

We often use TPC-H and TPC-DS, they're pretty standard for analytical databases. There's a TPC-DS kit for Impala here: https://github.com/cloudera/impala-tpcds-kit

cloudman99 · ‎11-09-2016

@lizhenmxcz @Tim Armstrong OP how did you get that query summary? I would like to be able to analyze my own queries like that.

Tim Armstrong · ‎11-16-2016

If you're using impala-shell, you can use the "summary;" command. Otherwise it's accessible through the Impala debug web pages (typically http://the-impala-server:25000)

Cloudera Community

Support Questions

impala memory limit exceed

IMPALA - MEMORY LIMIT EXEDED

Impala Memory limit exceed

Impala - Memory limit exceeded

impala query memory limit

Spark Memory Management

ERROR:is running beyond physical memory limits.

Impala writes on Iceberg

impala memory limit exceed

physical memory limits

Spark error - Decimal precision exceeds max precis...