Support Questions

Find answers, ask questions, and share your expertise

Hive Order By with Limits query performance optimazation

avatar
Contributor

Hi

Is there any way to speed up the Hive Order by With Limit Query in any way?

suppose the base table contains 1Km (1,000,000,000) rows and now perform the query

Select * from Table t order by t.order_num Limit 10000;

no table and query modification allowed since the real query are much complicated in the Select part and a pre-sorted/Orc Partitioned tables does not helps much. since all rows goes to the same reducer to get the correct order, it greatly impacted the performance of the Hive on Tez performance. And we are in an POC to show off The power of Hive on Tez but got stuck in this order by query.

Thank you very much in advance

1 ACCEPTED SOLUTION

avatar
Super Guru

@Yan Liu

In the query you are running, set the following

hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack.

Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column).

this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

View solution in original post

6 REPLIES 6

avatar
Super Guru

@Yan Liu

In the query you are running, set the following

hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack.

Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column).

this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

avatar
Contributor

Thanks @mqureshi

Does DISTRIBUTE BY (order by column), SORT BY (sort column). always give the same result as using ORDER BY (order by column)? Order by is a total order and distributed by -> sort by could give the total order as well ? because we have an LIMIT clause at the end.

avatar
Super Guru

@Yan Liu

Depending on how you write the two queries, yes, absolutely, they should give you the same results. As for LIMIT clause, you can add LIMIT clause to your DISTRIBUTE BY SORT BY query at the end just like you would in the ORDER BY query.

avatar
Super Collaborator

Order by produces a total order for the result set. Sort By sorts the output of each reducer so in general this will not produce the same answer

avatar
Contributor

Hi @Eugene Koifman

Just Wondering if hive.mapred.mode=strict , why hive not using distribute by sort by Limit to replace the order by execution plan?

i have tested with my data , it seems those two query are identical on final result;

@mqureshi

Thank you very much , in general distribute by sort by is not the same as order by , but if we had a limit after it , hive will put another reducer to make the final result. Thank you very much

avatar
Contributor

@mqureshi

Thanks for your answer. turn off the nonstrict does offer some speed to the query.

@Eugene Koifman

The most interesting thing i found was setting the hive.map.aggr = false . it reduced the query speed from 15mins down to 3 mins .