# Support Questions

## aggregate function with join gives wrong value in hive simran_k
Expert Contributor

This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?

```  SELECT
SUM(ingrn.passed_qty)
FROM
erp.fabric_grn ingrn
LEFT JOIN
erp.fabric_outgrn outgrn
ON
UPPER(ingrn.fab_id) = outgrn.out_id
GROUP BY UPPER(ingrn.fab_id) ```

It gives a different value from:

```  SELECT
SUM(ingrn.passed_qty)
FROM
erp.fabric_grn ingrn
GROUP BY UPPER(ingrn.fab_id)```
1 ACCEPTED SOLUTION awoolford
Rising Star

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

```hive> CREATE TABLE table_a (
>   id STRING,
>   quantity INT
> );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);```

table_b has duplicate id's: note that id=1 appears twice:

```hive> CREATE TABLE table_b (
>   id STRING
> );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);
```

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

```hive> SELECT
>   SUM(quantity)
> FROM table_a;
60```

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

```hive> SELECT
>   *
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3```

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

```hive> SELECT
>   SUM(quantity)
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
90```

I suspect that's what's happening with your query.

4 REPLIES 4 awoolford
Rising Star

Are there duplicates in the outgrn.out_id column? simran_k
Expert Contributor

@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty awoolford
Rising Star

The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates. awoolford
Rising Star

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):

```hive> CREATE TABLE table_a (
>   id STRING,
>   quantity INT
> );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10);```

table_b has duplicate id's: note that id=1 appears twice:

```hive> CREATE TABLE table_b (
>   id STRING
> );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);
```

If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:

```hive> SELECT
>   SUM(quantity)
> FROM table_a;
60```

If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:

```hive> SELECT
>   *
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3```

Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:

```hive> SELECT
>   SUM(quantity)
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
90```

I suspect that's what's happening with your query. Take a Tour of the Community
Community Browser
Don't have an account?