<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive  Order By with Limits query performance optimazation in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138463#M56209</link>
    <description>&lt;P&gt;Hi  &lt;A rel="user" href="https://community.cloudera.com/users/1027/ekoifman.html" nodeid="1027"&gt;@Eugene Koifman&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Just Wondering   if  h&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapred.mode"&gt;ive.mapred.mode&lt;/A&gt;=strict  , why hive not using distribute by  sort by  Limit  to replace the order by execution plan?&lt;/P&gt;&lt;P&gt;i have tested with my data , it seems those two query are identical on final result; &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10969/mqureshi.html" nodeid="10969"&gt;@mqureshi&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thank you very much , in general distribute by sort by is not the same as order by , but if we had a limit after it , hive will put another reducer to make the final result.  Thank you very much &lt;/P&gt;</description>
    <pubDate>Wed, 22 Mar 2017 00:55:26 GMT</pubDate>
    <dc:creator>liuyanpunk</dc:creator>
    <dc:date>2017-03-22T00:55:26Z</dc:date>
    <item>
      <title>Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138457#M56203</link>
      <description>&lt;P&gt;Hi    &lt;/P&gt;&lt;P&gt;   Is there any way to speed up the Hive Order by With Limit Query in any way?&lt;/P&gt;&lt;P&gt;  suppose the base table contains 1Km (1,000,000,000)  rows  and now perform the query &lt;/P&gt;&lt;P&gt;  Select * from Table t  order by t.order_num  Limit 10000; &lt;/P&gt;&lt;P&gt;  no table and query modification allowed since the real query are much complicated in the Select part and a pre-sorted/Orc Partitioned tables does not helps much.     since all rows goes to the same reducer to get  the correct order, it greatly impacted the performance of the Hive on Tez performance.   And we are in an POC to show off The power of Hive on Tez but got stuck in this order by query. &lt;/P&gt;&lt;P&gt;Thank you very much in advance&lt;/P&gt;</description>
      <pubDate>Mon, 06 Mar 2017 00:25:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138457#M56203</guid>
      <dc:creator>liuyanpunk</dc:creator>
      <dc:date>2017-03-06T00:25:26Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138458#M56204</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10194/liuyanpunk.html" nodeid="10194"&gt;@Yan Liu&lt;/A&gt;&lt;/P&gt;&lt;P&gt;In the query you are running, set the following&lt;/P&gt;&lt;P&gt;h&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapred.mode"&gt;ive.mapred.mode&lt;/A&gt;=nonstrict --&amp;gt;it should run much faster but your customer might not be happy with this hack.&lt;/P&gt;&lt;P&gt;Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column).&lt;/P&gt;&lt;P&gt;this will create multiple reducers on distribute by (column name)  and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy.&lt;/P&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Mar 2017 02:44:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138458#M56204</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-03-06T02:44:31Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138459#M56205</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/10969/mqureshi.html" nodeid="10969"&gt;@mqureshi&lt;/A&gt; &lt;/P&gt;&lt;P&gt; Does DISTRIBUTE BY (order by column), SORT BY (sort column). always give the same result as using ORDER BY (order by column)?   Order by is a total order and distributed by -&amp;gt; sort by could give the total order as well ? because we have an LIMIT clause at the end. &lt;/P&gt;</description>
      <pubDate>Mon, 06 Mar 2017 10:06:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138459#M56205</guid>
      <dc:creator>liuyanpunk</dc:creator>
      <dc:date>2017-03-06T10:06:59Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138460#M56206</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10194/liuyanpunk.html" nodeid="10194"&gt;@Yan Liu&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Depending on how you write the two queries, yes, absolutely, they should give you the same results. As for LIMIT clause, you can add LIMIT clause to your DISTRIBUTE BY SORT BY query at the end just like you would in the ORDER BY query.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Mar 2017 10:12:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138460#M56206</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-03-06T10:12:18Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138461#M56207</link>
      <description>&lt;P&gt;Order by produces a total order for the result set.  Sort By sorts the output of each reducer so in general this will not produce the same answer&lt;/P&gt;</description>
      <pubDate>Tue, 07 Mar 2017 02:38:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138461#M56207</guid>
      <dc:creator>ekoifman</dc:creator>
      <dc:date>2017-03-07T02:38:21Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138462#M56208</link>
      <description>&lt;P&gt;@mqureshi&lt;/P&gt;&lt;P&gt;Thanks for your answer.  turn off the nonstrict   does offer some speed to the query. &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1027/ekoifman.html" nodeid="1027"&gt;@Eugene Koifman&lt;/A&gt; &lt;/P&gt;&lt;P&gt;The most interesting thing i found was setting the hive.map.aggr = false . it reduced the query speed from 15mins down to 3 mins . &lt;/P&gt;&lt;P&gt;  &lt;/P&gt;</description>
      <pubDate>Tue, 07 Mar 2017 03:04:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138462#M56208</guid>
      <dc:creator>liuyanpunk</dc:creator>
      <dc:date>2017-03-07T03:04:05Z</dc:date>
    </item>
    <item>
      <title>Re: Hive  Order By with Limits query performance optimazation</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138463#M56209</link>
      <description>&lt;P&gt;Hi  &lt;A rel="user" href="https://community.cloudera.com/users/1027/ekoifman.html" nodeid="1027"&gt;@Eugene Koifman&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Just Wondering   if  h&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.mapred.mode"&gt;ive.mapred.mode&lt;/A&gt;=strict  , why hive not using distribute by  sort by  Limit  to replace the order by execution plan?&lt;/P&gt;&lt;P&gt;i have tested with my data , it seems those two query are identical on final result; &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10969/mqureshi.html" nodeid="10969"&gt;@mqureshi&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thank you very much , in general distribute by sort by is not the same as order by , but if we had a limit after it , hive will put another reducer to make the final result.  Thank you very much &lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2017 00:55:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Order-By-with-Limits-query-performance-optimazation/m-p/138463#M56209</guid>
      <dc:creator>liuyanpunk</dc:creator>
      <dc:date>2017-03-22T00:55:26Z</dc:date>
    </item>
  </channel>
</rss>

