<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question aggregate function with join  gives wrong value in hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152203#M114670</link>
    <description>&lt;P&gt;This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?&lt;/P&gt;&lt;PRE&gt;  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  LEFT JOIN
  erp.fabric_outgrn outgrn 
  ON
  UPPER(ingrn.fab_id) = outgrn.out_id
  GROUP BY UPPER(ingrn.fab_id) &lt;/PRE&gt;&lt;P&gt;It gives a different value from:&lt;/P&gt;&lt;PRE&gt;  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  GROUP BY UPPER(ingrn.fab_id)&lt;/PRE&gt;</description>
    <pubDate>Wed, 14 Dec 2016 12:49:20 GMT</pubDate>
    <dc:creator>simran_k</dc:creator>
    <dc:date>2016-12-14T12:49:20Z</dc:date>
    <item>
      <title>aggregate function with join  gives wrong value in hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152203#M114670</link>
      <description>&lt;P&gt;This is what I have so far. The value that I get from SUM is much higher with join than what I get without join.I can't figure out why is that?&lt;/P&gt;&lt;PRE&gt;  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  LEFT JOIN
  erp.fabric_outgrn outgrn 
  ON
  UPPER(ingrn.fab_id) = outgrn.out_id
  GROUP BY UPPER(ingrn.fab_id) &lt;/PRE&gt;&lt;P&gt;It gives a different value from:&lt;/P&gt;&lt;PRE&gt;  SELECT 
  SUM(ingrn.passed_qty)
  FROM
  erp.fabric_grn ingrn
  GROUP BY UPPER(ingrn.fab_id)&lt;/PRE&gt;</description>
      <pubDate>Wed, 14 Dec 2016 12:49:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152203#M114670</guid>
      <dc:creator>simran_k</dc:creator>
      <dc:date>2016-12-14T12:49:20Z</dc:date>
    </item>
    <item>
      <title>Re: aggregate function with join  gives wrong value in hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152204#M114671</link>
      <description>&lt;P&gt;Are there duplicates in the outgrn.out_id column?&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 13:15:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152204#M114671</guid>
      <dc:creator>awoolford</dc:creator>
      <dc:date>2016-12-14T13:15:19Z</dc:date>
    </item>
    <item>
      <title>Re: aggregate function with join  gives wrong value in hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152205#M114672</link>
      <description>&lt;P&gt;@alex: Yes, but how does that change anything? I am still grouping by ingrn.fab_id and summing ingrn.passed_qty&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 13:21:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152205#M114672</guid>
      <dc:creator>simran_k</dc:creator>
      <dc:date>2016-12-14T13:21:54Z</dc:date>
    </item>
    <item>
      <title>Re: aggregate function with join  gives wrong value in hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152206#M114673</link>
      <description>&lt;P&gt;The join happens before the aggregation. You're aggregating the result of the join, which has inflated the row count because there are duplicates.&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 14:23:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152206#M114673</guid>
      <dc:creator>awoolford</dc:creator>
      <dc:date>2016-12-14T14:23:21Z</dc:date>
    </item>
    <item>
      <title>Re: aggregate function with join  gives wrong value in hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152207#M114674</link>
      <description>&lt;P&gt;Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key):&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; CREATE TABLE table_a (
    &amp;gt;   id STRING,
    &amp;gt;   quantity INT
    &amp;gt; );
hive&amp;gt; INSERT INTO table_a VALUES (1, 30);
hive&amp;gt; INSERT INTO table_a VALUES (2, 20);
hive&amp;gt; INSERT INTO table_a VALUES (3, 10);&lt;/PRE&gt;&lt;P&gt;table_b has duplicate id's: note that id=1 appears twice:&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; CREATE TABLE table_b (
    &amp;gt;   id STRING
    &amp;gt; );
hive&amp;gt; INSERT INTO table_b VALUES (1);
hive&amp;gt; INSERT INTO table_b VALUES (1);
hive&amp;gt; INSERT INTO table_b VALUES (2);
hive&amp;gt; INSERT INTO table_b VALUES (3);
&lt;/PRE&gt;&lt;P&gt;If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60:&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; SELECT
    &amp;gt;   SUM(quantity)
    &amp;gt; FROM table_a;
60&lt;/PRE&gt;&lt;P&gt;If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three:&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; SELECT
    &amp;gt;   *
    &amp;gt; FROM table_a
    &amp;gt; LEFT JOIN table_b
    &amp;gt; ON table_a.id = table_b.id;
1  30  1
1  30  1
2  20  2
3  10  3&lt;/PRE&gt;&lt;P&gt;Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated:&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; SELECT
    &amp;gt;   SUM(quantity)
    &amp;gt; FROM table_a
    &amp;gt; LEFT JOIN table_b
    &amp;gt; ON table_a.id = table_b.id;
90&lt;/PRE&gt;&lt;P&gt;I suspect that's what's happening with your query.&lt;/P&gt;</description>
      <pubDate>Wed, 14 Dec 2016 15:00:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/aggregate-function-with-join-gives-wrong-value-in-hive/m-p/152207#M114674</guid>
      <dc:creator>awoolford</dc:creator>
      <dc:date>2016-12-14T15:00:12Z</dc:date>
    </item>
  </channel>
</rss>

