Support Questions

Find answers, ask questions, and share your expertise

aggregate function with join gives wrong value in hive

avatar
Expert Contributor

This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  LEFT JOIN
  erp.fabric_outgrn outgrn 
  ON
  UPPER(ingrn.fab_id) = outgrn.out_id
  GROUP BY UPPER(ingrn.fab_id) 

It gives a different value from:

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  GROUP BY UPPER(ingrn.fab_id)
1 ACCEPTED SOLUTION

avatar
Expert Contributor

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

hive> CREATE TABLE table_a (
    >   id STRING,
    >   quantity INT
    > );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);

table_b has duplicate id's: note that id=1 appears twice:

hive> CREATE TABLE table_b (
    >   id STRING
    > );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a;
60

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

hive> SELECT
    >   *
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
90

I suspect that's what's happening with your query.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Are there duplicates in the outgrn.out_id column?

avatar
Expert Contributor

@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty

avatar
Expert Contributor

The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates.

avatar
Expert Contributor

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

hive> CREATE TABLE table_a (
    >   id STRING,
    >   quantity INT
    > );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);

table_b has duplicate id's: note that id=1 appears twice:

hive> CREATE TABLE table_b (
    >   id STRING
    > );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a;
60

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

hive> SELECT
    >   *
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
90

I suspect that's what's happening with your query.