Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive original order of records using collect_list

Solved Go to solution
Highlighted

Hive original order of records using collect_list

New Contributor

In Hive how to do collect_list(path) in ascending order based on servicetime column grouping by service_date,timeslot,customer_id,category,product,sub_product.

Path order is correct when the select query is executed.

Correct Answer Hive Query -> 

select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0001|Dummy

But when I execute on 1 gb of data for bulk aggregation from one table to another table the order of path is not correct.

Wrong Answer Hive Query -> with isort as

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0001|Dummy

Like Mysql am trying to write hive query with order by clause am facing an error

Error Hive Query like MySQL -> with isort as (

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product order by servicetime;
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hive original order of records using collect_list

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;
1 REPLY 1

Re: Hive original order of records using collect_list

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;
Don't have an account?
Coming from Hortonworks? Activate your account here