Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Hive original order of records using collect_list

avatar

In Hive how to do collect_list(path) in ascending order based on servicetime column grouping by service_date,timeslot,customer_id,category,product,sub_product.

Path order is correct when the select query is executed.

Correct Answer Hive Query -> 

select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0001|Dummy

But when I execute on 1 gb of data for bulk aggregation from one table to another table the order of path is not correct.

Wrong Answer Hive Query -> with isort as

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product;
Output : 38,|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0001|Dummy

Like Mysql am trying to write hive query with order by clause am facing an error

Error Hive Query like MySQL -> with isort as (

select calls,path from raw_test order by servicetime asc)
from isort
insert into table test 
select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product order by servicetime;
1 ACCEPTED SOLUTION

avatar

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;

View solution in original post

3 REPLIES 3

avatar

@Balachandran Karnati

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

Find a simple example below:

drop table collect;
create external table collect(
  key string,
  order_field int,
  value double
   
)
row format delimited fields terminated by ',' 
stored as textfile 
;
[root@sandbox ~]# cat collect1.txt
a,1,1.0
a,3,3.0
a,2,2.0
b,1,1.0
b,2,2.0
b,3,3.0
[root@sandbox ~]# cat collect2.txt
a,1,1.1
a,3,3.1
a,2,2.1
b,1,1.1
b,2,2.1
b,3,3.1
[root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted;
create table collect_sorted as 
select key, collect_list(value)
from 
(select * from collect sort by key, order_field, value desc) x
group by key
;

avatar
New Member

For me (working on CDP) this is still now working. 

create temporary table stage1 as SELECT *
FROM (
SELECT customer
, ts
, sum(enodmywindow) over (partition by customer order by ts desc rows between unbounded preceding and current row) mywindow
, sum(1) over (partition by customer order by ts desc rows between unbounded preceding and current row) row_count
FROM od_stage1
WHERE previous_day_ts is null -- to change when the next day ts is fixed
) a
DISTRIBUTE BY customer, mywindow sort by customer, mywindow, row_count
;

create temporary table stage2 as
SELECT customer
, mywindow
, collect_list(ts) ts_list
, collect_list(row_count) row_count_list
FROM (select *
from stage1
DISTRIBUTE BY customer, mywindow sort by customer, mywindow, row_count ) x
group by customer
, mywindow
;

+-------------------+--------------------------+-------------+----------------------------------------------------+
| customer | trip_count | row_count_list |
+-------------------+--------------------------+-------------+----------------------------------------------------+
| someone | 2023-01-23 17:18:30.89 | 8 | [110] |
| someone | 2023-01-24 01:34:19.052 | 7 | [109] |
| someone | 2023-01-24 03:53:09.388 | 6 | [108] |
| someone | 2023-01-24 09:32:03.172 | 5 | [71,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107] |
| someone | 2023-01-24 11:48:21.606 | 4 | [51,50] |
| someone | 2023-01-24 12:29:52.992 | 3 | [48,49] |
| someone | 2023-01-24 17:35:35.846 | 2 | [46,47] |
| someone | 2023-01-24 20:03:30.239 | 1 | [20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45] |
+-------------------+--------------------------+-------------+----------------------------------------------------+

avatar
New Member

Sorry, pasted the wrong output:

+---------------+-----------+----------------------------------------------------+
| customer | mywindow | row_count_list |
+---------------+-----------+----------------------------------------------------+
| somecustomer | 1 | [20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45] |
| somecustomer | 2 | [46,47] |
| somecustomer | 3 | [48,49] |
| somecustomer | 4 | [51,50] |
| somecustomer | 5 | [71,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107] |
| somecustomer | 6 | [108] |
| somecustomer | 7 | [109] |
| somecustomer | 8 | [110] |
+---------------+-----------+----------------------------------------------------+