Support Questions

Tim Armstrong · ‎04-27-2016

It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is:

"Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353."

In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all.

First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help.

Second, it will try to spill some of the rows to disk and process a subset of those.

Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.

View solution in original post

Cloudera Community

Support Questions

Who agreed with this solution