I'm getting dizzy reading all the contradictory information about Hive and sorted tables. The Apache documentation seems clear in stating that it is the user's responsiblity to pre-sort data being inserted. However, if that's the case then what purpose does the hive.enforce.sorting setting serve? I've seen at least two Blogs that suggest sorting will happen automatically if you set that true. Anyone?
Hi @Steven Hirsch. It's easy when you just consider 2 commands: ORDER BY and SORT BY. Here is what Hive manual says:
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
So, if you need absolute ordering guarantee than use "order by" otherwise, use "sort by". Obviously, guaranteed ordering could have significant performance issues over large datasets.
Hi, Scott. That doesn't really address my question. I understand what ORDER BY and SORT BY do when used in a query. What I need to understand is what the semantics are when unsorted data is inserted into a target table that is (a) defined as bucketed and sorted by some set of columns and (b) the environment has had 'hive.enforce.sorting' set true. I have read postings elsewhere that imply the sort ordering in the buckets will be maintained automatically and dynamically by Hive. Is that not the case? If it's not, then can you explain what the semantics of 'hive.enforce.sorting' are?
Sorry @Steven Hirsch, I misread the question. This may not answer your question directly but this is what I understand about sorting:
1. Sorting is a on a per bucket level. There is no global sort condition so if your query spans buckets than an additional sorting step will occur. Someone recommended a workaround is to keep the same number of buckets as reducers.
2. Providing the "sort by" clause on table definition will only sort the data on read and not during insert. (See #3)
3. The "hive.enforce.sorting" will sort data when inserted into a table defined with a sort clause. This setting is deprecated in Hive 2.x and is now "true" by default (https://issues.apache.org/jira/browse/HIVE-12331)
The question remains whether setting "hive.enforce.sorting" to true and inserting unsorted data into a sorted, bucketed table will sort the data across buckets? This I don't know for sure but could be easily tested.
Hope this helps.
As I understand it, MR/Tez will create additional reducers to do sorting of the data when you perform a SORT BY. If you set this setting to true, the data is sorted when its inserted into the table. When you then perform a SORT BY, less resources are used (fewer reducers) and the query should be faster.
This setting has been removed in Hive 2.x. It essentially defaults to true.
HDP 2.4 includes Hive 1.2.1, so the setting still exists and it defaults to false.