Support Questions

Find answers, ask questions, and share your expertise

Hive original order of records using collect_list

avatar
New Contributor

In Hive how to do collect_list(path) in ascending order based on servicetime column grouping by service_date,timeslot,customer_id,category,product,sub_product.

Path order is correct when the select query is executed.

Correct Answer Hive Query -> 

select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0001|Dummy

But when I execute on 1 gb of data for bulk aggregation from one table to another table the order of path is not correct.

Wrong Answer Hive Query -> with isort as

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0001|Dummy

Like Mysql am trying to write hive query with order by clause am facing an error

Error Hive Query like MySQL -> with isort as (

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product order by servicetime;
1 ACCEPTED SOLUTION

avatar

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;

View solution in original post

3 REPLIES 3

avatar

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;

avatar
New Contributor

For me (working on CDP) this is still now working. 

create temporary table stage1 as SELECT *
FROM (
SELECT customer
, ts
, sum(enodmywindow) over (partition by customer order by ts desc rows between unbounded preceding and current row) mywindow
, sum(1) over (partition by customer order by ts desc rows between unbounded preceding and current row) row_count
FROM od_stage1
WHERE previous_day_ts is null -- to change when the next day ts is fixed
) a
DISTRIBUTE BY customer, mywindow sort by customer, mywindow, row_count
;

create temporary table stage2 as
SELECT customer
, mywindow
, collect_list(ts) ts_list
, collect_list(row_count) row_count_list
FROM (select *
from stage1
DISTRIBUTE BY customer, mywindow sort by customer, mywindow, row_count ) x
group by customer
, mywindow
;

+-------------------+--------------------------+-------------+----------------------------------------------------+
| customer | trip_count | row_count_list |
+-------------------+--------------------------+-------------+----------------------------------------------------+
| someone | 2023-01-23 17:18:30.89 | 8 | [110] |
| someone | 2023-01-24 01:34:19.052 | 7 | [109] |
| someone | 2023-01-24 03:53:09.388 | 6 | [108] |
| someone | 2023-01-24 09:32:03.172 | 5 | [71,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107] |
| someone | 2023-01-24 11:48:21.606 | 4 | [51,50] |
| someone | 2023-01-24 12:29:52.992 | 3 | [48,49] |
| someone | 2023-01-24 17:35:35.846 | 2 | [46,47] |
| someone | 2023-01-24 20:03:30.239 | 1 | [20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45] |
+-------------------+--------------------------+-------------+----------------------------------------------------+

avatar
New Contributor

Sorry, pasted the wrong output:

+---------------+-----------+----------------------------------------------------+
| customer | mywindow | row_count_list |
+---------------+-----------+----------------------------------------------------+
| somecustomer | 1 | [20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45] |
| somecustomer | 2 | [46,47] |
| somecustomer | 3 | [48,49] |
| somecustomer | 4 | [51,50] |
| somecustomer | 5 | [71,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107] |
| somecustomer | 6 | [108] |
| somecustomer | 7 | [109] |
| somecustomer | 8 | [110] |
+---------------+-----------+----------------------------------------------------+