I read in many articles about Hive queries optimization the advice that consists in presort tables in order to optimize joins.
I read too, in others articles, that the sort algorithm (when shuffling) used is the QuickSort or derived...
So, I am a little bit confused, is the Quick Sort fastest when it takes an array composed of 2 sorted arrays ?
When you're joining two tables and all the data of both tables is sorted by the join column, hive can use this to simplify the join algorithm (it will use a merge join). What the actual fastest way for a join is depends on other factors as well though (sizes of the tables, etc).
Thanks @Gunther Hagleitner for your answer.
But, how does Hive know that datas of both tables are sorted ?
When I join 2 tables and both are not sorted, the explain command tells me that Hive will perform a "Merge Join" too...
I know that a SMB join (Sort Merge Bucket) will improve a join, but, this needs bucketed tables...