Created 12-14-2016 04:49 AM
This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?
SELECT SUM(ingrn.passed_qty) FROM erp.fabric_grn ingrn LEFT JOIN erp.fabric_outgrn outgrn ON UPPER(ingrn.fab_id) = outgrn.out_id GROUP BY UPPER(ingrn.fab_id)
It gives a different value from:
SELECT SUM(ingrn.passed_qty) FROM erp.fabric_grn ingrn GROUP BY UPPER(ingrn.fab_id)
Created 12-14-2016 07:00 AM
Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):
hive> CREATE TABLE table_a ( > id STRING, > quantity INT > ); hive> INSERT INTO table_a VALUES (1, 30); hive> INSERT INTO table_a VALUES (2, 20); hive> INSERT INTO table_a VALUES (3, 10);
table_b has duplicate id's: note that id=1 appears twice:
hive> CREATE TABLE table_b ( > id STRING > ); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (2); hive> INSERT INTO table_b VALUES (3);
If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:
hive> SELECT > SUM(quantity) > FROM table_a; 60
If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:
hive> SELECT > * > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 1 30 1 1 30 1 2 20 2 3 10 3
Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:
hive> SELECT > SUM(quantity) > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 90
I suspect that's what's happening with your query.
Created 12-14-2016 05:15 AM
Are there duplicates in the outgrn.out_id column?
Created 12-14-2016 05:21 AM
@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty
Created 12-14-2016 06:23 AM
The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates.
Created 12-14-2016 07:00 AM
Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):
hive> CREATE TABLE table_a ( > id STRING, > quantity INT > ); hive> INSERT INTO table_a VALUES (1, 30); hive> INSERT INTO table_a VALUES (2, 20); hive> INSERT INTO table_a VALUES (3, 10);
table_b has duplicate id's: note that id=1 appears twice:
hive> CREATE TABLE table_b ( > id STRING > ); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (2); hive> INSERT INTO table_b VALUES (3);
If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:
hive> SELECT > SUM(quantity) > FROM table_a; 60
If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:
hive> SELECT > * > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 1 30 1 1 30 1 2 20 2 3 10 3
Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:
hive> SELECT > SUM(quantity) > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 90
I suspect that's what's happening with your query.