Created 03-05-2017 04:25 PM
Hi
Is there any way to speed up the Hive Order by With Limit Query in any way?
suppose the base table contains 1Km (1,000,000,000) rows and now perform the query
Select * from Table t order by t.order_num Limit 10000;
no table and query modification allowed since the real query are much complicated in the Select part and a pre-sorted/Orc Partitioned tables does not helps much. since all rows goes to the same reducer to get the correct order, it greatly impacted the performance of the Hive on Tez performance. And we are in an POC to show off The power of Hive on Tez but got stuck in this order by query.
Thank you very much in advance
Created 03-05-2017 06:44 PM
In the query you are running, set the following
hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack.
Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column).
this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
Created 03-05-2017 06:44 PM
In the query you are running, set the following
hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack.
Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column).
this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
Created 03-06-2017 02:06 AM
Thanks @mqureshi
Does DISTRIBUTE BY (order by column), SORT BY (sort column). always give the same result as using ORDER BY (order by column)? Order by is a total order and distributed by -> sort by could give the total order as well ? because we have an LIMIT clause at the end.
Created 03-06-2017 02:12 AM
Depending on how you write the two queries, yes, absolutely, they should give you the same results. As for LIMIT clause, you can add LIMIT clause to your DISTRIBUTE BY SORT BY query at the end just like you would in the ORDER BY query.
Created 03-06-2017 06:38 PM
Order by produces a total order for the result set. Sort By sorts the output of each reducer so in general this will not produce the same answer
Created 03-21-2017 05:55 PM
Just Wondering if hive.mapred.mode=strict , why hive not using distribute by sort by Limit to replace the order by execution plan?
i have tested with my data , it seems those two query are identical on final result;
Thank you very much , in general distribute by sort by is not the same as order by , but if we had a limit after it , hive will put another reducer to make the final result. Thank you very much
Created 03-06-2017 07:04 PM
@mqureshi
Thanks for your answer. turn off the nonstrict does offer some speed to the query.
The most interesting thing i found was setting the hive.map.aggr = false . it reduced the query speed from 15mins down to 3 mins .