Support Questions

Find answers, ask questions, and share your expertise

[HIVE] does presorted tables imply best performances in joining ?


I read in many articles about Hive queries optimization the advice that consists in presort tables in order to optimize joins.

I read too, in others articles, that the sort algorithm (when shuffling) used is the QuickSort or derived...

So, I am a little bit confused, is the Quick Sort fastest when it takes an array composed of 2 sorted arrays ?




When you're joining two tables and all the data of both tables is sorted by the join column, hive can use this to simplify the join algorithm (it will use a merge join). What the actual fastest way for a join is depends on other factors as well though (sizes of the tables, etc).

Thanks @Gunther Hagleitner for your answer.

But, how does Hive know that datas of both tables are sorted ?

When I join 2 tables and both are not sorted, the explain command tells me that Hive will perform a "Merge Join" too...

I know that a SMB join (Sort Merge Bucket) will improve a join, but, this needs bucketed tables...

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.