Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how can you sort larger data set on insert?

how can you sort larger data set on insert?

New Contributor

I had a similar issue to https://community.hortonworks.com/questions/10513/hive-original-order-of-records-using-collect-list..... When I use sort by like Guilherme mentioned on a small set of data,it is ordering the collect_list correctly. When I attempt to run it on the full data set, I've noticed that the order of the collect_list is no longer sorted. I added the set hive.enforce.sorting=true based off of: https://community.hortonworks.com/questions/49931/does-hive-sort-input-or-not.html since we are on version 1.2.1, but it didn't make a difference. Any help would be much appreciated.

i'm doing something like the below:

set hive.enforce.sorting=true; insert into database2.filtered_rn select cust_id, timeOf, keys, cd, score, collect_list(weight), concat_WS(',',collect_list(ui)), comp, ROW_NUMBER() over(partition by cust_id order by timeOf desc) as row_num FROM ( select ld.cust_id, ld.timeOf, ld.keys, lf.cd, ld.score, t.weight, t.ui, COALESCE(ld.compI, 0L) as comp from database1.xxxx ld inner join database2.xxxx lf on ld.lof_id = lf.lof_id left join database1.xxxx r on r.lof_id = ld.lof_id left join datbase1.xxxx t on r.score_id = t.score_id) distribute by (cust_id,timeOf) sort by abs(weight) desc)sub group by cust_id, timeOf, keys, cd, score, comp;

3 REPLIES 3

Re: how can you sort larger data set on insert?

Expert Contributor

Maybe you could format your code for more answers:

set hive.enforce.sorting=true; 
insert into database2.filtered_rn 
 select cust_id, timeOf, keys, cd, score, collect_list(weight), concat_WS(',',collect_list(ui)), comp, ROW_NUMBER() over(partition by cust_id order by timeOf desc) as row_num 
FROM ( 
   select ld.cust_id, ld.timeOf, ld.keys, lf.cd, ld.score, t.weight, t.ui, COALESCE(ld.compI, 0L) as comp 
   from database1.xxxx ld 
      inner join database2.xxxx lf on ld.lof_id = lf.lof_id 
      left join database1.xxxx r on r.lof_id = ld.lof_id 
      left join datbase1.xxxx t on r.score_id = t.score_id)
   distribute by (cust_id,timeOf) 
   sort by abs(weight) desc) 
   sub group by cust_id, timeOf, keys, cd, score, comp;

Re: how can you sort larger data set on insert?

Expert Contributor

I'm just going to be that the "distrubute by" is actually throwing things off. Just a guess. No evidence to back it up.

Re: how can you sort larger data set on insert?

Expert Contributor

@Ryan Redmond

I think my guess was kinda correct... According to the docs:

Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.


Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

So because you are using "Sort by" with "Distribute by" you are going to continue to get a funky order. By reducer your are getting guaranteed order... I have a feeling Even if you change to using order by you still may get weird results if you keep using "Distribubte by" with columns that don't contain the "sort by" column.

Don't have an account?
Coming from Hortonworks? Activate your account here