<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Optimize a long running hive query - has a join with same table in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99035#M62109</link>
    <description>&lt;P&gt;The table is in ORC and haven't tried SMB, but the process is getting stuck at the last reducer and the whole processing runs in a single node while other nodes are idle.&lt;/P&gt;&lt;P&gt;Can you explain the function of  "hive.map.aggr" parameter?&lt;/P&gt;</description>
    <pubDate>Sun, 27 Dec 2015 04:28:33 GMT</pubDate>
    <dc:creator>sooraj_antony</dc:creator>
    <dc:date>2015-12-27T04:28:33Z</dc:date>
    <item>
      <title>Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99028#M62102</link>
      <description>&lt;P&gt;The below query takes a lot of time to execute. It is run with tez execution engine.&lt;/P&gt;&lt;PRE&gt;SELECT STG.EMP_TYPE,DEPT,COUNT(DISTINCT EMP_ID) AS COUNT, A.TOTAL_COUNT
FROM STAGE_SOURCE STG 
LEFT OUTER JOIN 
(SELECT EMP_TYPE,COUNT(DISTINCT EMP_ID) AS TOTAL_COUNT FROM STAGE_SOURCE GROUP BY EMP_TYPE) A
ON STG.EMP_TYPE = A.EMP_TYPE
GROUP BY STG.EMP_TYPE,DEPT,A.TOTAL_COUNT;&lt;/PRE&gt;&lt;P&gt;Is there any rewrite option or optimization strategy which can improve the query performance?&lt;/P&gt;&lt;P&gt;The subquery with alias "A" itself takes 2-3hrs to execute.&lt;/P&gt;&lt;P&gt;Attaching the explain plan of just the join subquery "A"&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/896-explain-plan.txt"&gt;explain-plan.txt&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2015 04:23:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99028#M62102</guid>
      <dc:creator>sooraj_antony</dc:creator>
      <dc:date>2015-12-14T04:23:11Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99029#M62103</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1213/soorajantony.html" nodeid="1213"&gt;@Sooraj Antony&lt;/A&gt; Please attach explain plan. You can copy the plan in text and upload in pdf format. &lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2015 08:22:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99029#M62103</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-14T08:22:44Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99030#M62104</link>
      <description>&lt;P&gt;The COUNT(DISTINCT) could be the bottleneck if it is not being parallelized. Can you share the explain plan ? &lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2015 02:35:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99030#M62104</guid>
      <dc:creator>jpp</dc:creator>
      <dc:date>2015-12-17T02:35:46Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99031#M62105</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/140/nsabharwal.html" nodeid="140"&gt;@Neeraj Sabharwal&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/152/jp.html" nodeid="152"&gt;@Jean-Philippe Player &lt;/A&gt;Explain Plan file is attached.. &lt;/P&gt;</description>
      <pubDate>Fri, 18 Dec 2015 03:33:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99031#M62105</guid>
      <dc:creator>sooraj_antony</dc:creator>
      <dc:date>2015-12-18T03:33:26Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99032#M62106</link>
      <description>&lt;P&gt;some obvious optimizations would be to convert the tables to ORC, you're using text, the tables are pretty large and ORC can help by skipping batches of rows that don't match the criteria. Another thing to look at is table ordering, here's more info &lt;A href="https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior&lt;/A&gt; and &lt;A href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34015666" target="_blank"&gt;https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34015666&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Dec 2015 11:53:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99032#M62106</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2015-12-24T11:53:31Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99033#M62107</link>
      <description>&lt;P&gt;please check, if set hive.map.aggr=true or not. &lt;/P&gt;&lt;P&gt;Please change the table to ORC format and try to run only subquery. I hope, results should come fast.&lt;/P&gt;&lt;P&gt;Please check, if table can be bucketed on emp_type or not. try to implement SMB join.&lt;/P&gt;</description>
      <pubDate>Sat, 26 Dec 2015 21:39:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99033#M62107</guid>
      <dc:creator>sachin_mca25</dc:creator>
      <dc:date>2015-12-26T21:39:08Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99034#M62108</link>
      <description>&lt;P&gt;Table &lt;/P&gt;&lt;PRE&gt;STAGE_SOURCE&lt;/PRE&gt;&lt;P&gt; is already in ORC format.&lt;/P&gt;</description>
      <pubDate>Sun, 27 Dec 2015 04:25:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99034#M62108</guid>
      <dc:creator>sooraj_antony</dc:creator>
      <dc:date>2015-12-27T04:25:08Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99035#M62109</link>
      <description>&lt;P&gt;The table is in ORC and haven't tried SMB, but the process is getting stuck at the last reducer and the whole processing runs in a single node while other nodes are idle.&lt;/P&gt;&lt;P&gt;Can you explain the function of  "hive.map.aggr" parameter?&lt;/P&gt;</description>
      <pubDate>Sun, 27 Dec 2015 04:28:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99035#M62109</guid>
      <dc:creator>sooraj_antony</dc:creator>
      <dc:date>2015-12-27T04:28:33Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99036#M62110</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1213/soorajantony.html" nodeid="1213"&gt;@Sooraj Antony&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I believe the problem is most of records has same value for EMP_TYPE (aka skewed records), that would cause all records from same EMP_TYPE value to be sent to same reducer and would cause the last reducer to take a long time to finish.&lt;/P&gt;&lt;P&gt;Assuming you have small number of different values of EMP_TYPE (and the result fit in memory), try solution below:&lt;/P&gt;&lt;PRE&gt;set hive.ignore.mapjoin.hint=false;

SELECT /*+ MAPJOIN(A) */ STG.EMP_TYPE,DEPT,COUNT(DISTINCT EMP_ID) AS COUNT, A.TOTAL_COUNT
FROM STAGE_SOURCE STG 
LEFT OUTER JOIN 
(SELECT EMP_TYPE,COUNT(DISTINCT EMP_ID) AS TOTAL_COUNT FROM STAGE_SOURCE GROUP BY EMP_TYPE) A
ON STG.EMP_TYPE = A.EMP_TYPE
GROUP BY STG.EMP_TYPE,DEPT,A.TOTAL_COUNT;
&lt;/PRE&gt;</description>
      <pubDate>Tue, 29 Dec 2015 03:54:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99036#M62110</guid>
      <dc:creator>gbraccialli3</dc:creator>
      <dc:date>2015-12-29T03:54:17Z</dc:date>
    </item>
    <item>
      <title>Re: Optimize a long running hive query - has a join with same table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99037#M62111</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1213/soorajantony.html" nodeid="1213"&gt;@Sooraj Antony&lt;/A&gt; has this been resolved? Can you post your solution or accept the best answer?&lt;/P&gt;</description>
      <pubDate>Wed, 03 Feb 2016 10:41:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Optimize-a-long-running-hive-query-has-a-join-with-same/m-p/99037#M62111</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-02-03T10:41:53Z</dc:date>
    </item>
  </channel>
</rss>

