Support Questions

simran_k · ‎12-14-2016

This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  LEFT JOIN
  erp.fabric_outgrn outgrn 
  ON
  UPPER(ingrn.fab_id) = outgrn.out_id
  GROUP BY UPPER(ingrn.fab_id)

It gives a different value from:

  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  GROUP BY UPPER(ingrn.fab_id)

awoolford · ‎12-14-2016

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

hive> CREATE TABLE table_a (
    >   id STRING,
    >   quantity INT
    > );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);

table_b has duplicate id's: note that id=1 appears twice:

hive> CREATE TABLE table_b (
    >   id STRING
    > );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a;
60

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

hive> SELECT
    >   *
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

hive> SELECT
    >   SUM(quantity)
    > FROM table_a
    > LEFT JOIN table_b
    > ON table_a.id = table_b.id;
90

I suspect that's what's happening with your query.

View solution in original post

awoolford · ‎12-14-2016

Are there duplicates in the outgrn.out_id column?

simran_k · ‎12-14-2016

@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty

awoolford · ‎12-14-2016

The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates.

awoolford · ‎12-14-2016