Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

[HIVE] does presorted tables imply best performances in joining ?

Highlighted

[HIVE] does presorted tables imply best performances in joining ?

New Contributor

Hi,

I read in many articles about Hive queries optimization the advice that consists in presort tables in order to optimize joins.

I read too, in others articles, that the sort algorithm (when shuffling) used is the QuickSort or derived...

So, I am a little bit confused, is the Quick Sort fastest when it takes an array composed of 2 sorted arrays ?

Thanks.

2 REPLIES 2

Re: [HIVE] does presorted tables imply best performances in joining ?

New Contributor

When you're joining two tables and all the data of both tables is sorted by the join column, hive can use this to simplify the join algorithm (it will use a merge join). What the actual fastest way for a join is depends on other factors as well though (sizes of the tables, etc).

Re: [HIVE] does presorted tables imply best performances in joining ?

New Contributor

Thanks @Gunther Hagleitner for your answer.

But, how does Hive know that datas of both tables are sorted ?

When I join 2 tables and both are not sorted, the explain command tells me that Hive will perform a "Merge Join" too...

I know that a SMB join (Sort Merge Bucket) will improve a join, but, this needs bucketed tables...