Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

aggregate function with join gives wrong value in hive

Solved Go to solution

aggregate function with join gives wrong value in hive

Expert Contributor

This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  LEFT JOIN
  erp.fabric_outgrn outgrn 
  ON
  UPPER(ingrn.fab_id) = outgrn.out_id
  GROUP BY UPPER(ingrn.fab_id) 

It gives a different value from:

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  GROUP BY UPPER(ingrn.fab_id)
1 ACCEPTED SOLUTION

Accepted Solutions

Re: aggregate function with join gives wrong value in hive

Rising Star

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

hive> CREATE TABLE table_a (
    >   id STRING,
    >   quantity INT
    > );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);

table_b has duplicate id's: note that id=1 appears twice:

hive> CREATE TABLE table_b (
    >   id STRING
    > );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a;
60

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

hive> SELECT
    >   *
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
90

I suspect that's what's happening with your query.

4 REPLIES 4

Re: aggregate function with join gives wrong value in hive

Rising Star

Are there duplicates in the outgrn.out_id column?

Re: aggregate function with join gives wrong value in hive

Expert Contributor

@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty

Re: aggregate function with join gives wrong value in hive

Rising Star

The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates.

Re: aggregate function with join gives wrong value in hive

Rising Star

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

hive> CREATE TABLE table_a (
    >   id STRING,
    >   quantity INT
    > );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);

table_b has duplicate id's: note that id=1 appears twice:

hive> CREATE TABLE table_b (
    >   id STRING
    > );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a;
60

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

hive> SELECT
    >   *
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
90

I suspect that's what's happening with your query.