Created 01-19-2016 08:55 AM
In Hive how to do collect_list(path) in ascending order based on servicetime column grouping by service_date,timeslot,customer_id,category,product,sub_product.
Path order is correct when the select query is executed.
Correct Answer Hive Query -> select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product; Output : 38,|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0001|Dummy
But when I execute on 1 gb of data for bulk aggregation from one table to another table the order of path is not correct.
Wrong Answer Hive Query -> with isort as
select calls,path from raw_test order by servicetime asc) from isort insert into table test select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product; Output : 38,|MN_0003|MN_0002:|MN_0001|Dummy:|MN_0038|MN_0008|MN_0003|MN_0002|MN_0001|Dummy:|MN_0001|Dummy
Like Mysql am trying to write hive query with order by clause am facing an error
Error Hive Query like MySQL -> with isort as (
select calls,path from raw_test order by servicetime asc) from isort insert into table test select sum(calls),concat_ws(':',collect_list(path)) FROM test group by service_date,timeslot,customer_id,category,product,sub_product order by servicetime;
Created 01-20-2016 12:07 AM
Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.
Find a simple example below:
drop table collect; create external table collect( key string, order_field int, value double ) row format delimited fields terminated by ',' stored as textfile ;
[root@sandbox ~]# cat collect1.txt a,1,1.0 a,3,3.0 a,2,2.0 b,1,1.0 b,2,2.0 b,3,3.0 [root@sandbox ~]# cat collect2.txt a,1,1.1 a,3,3.1 a,2,2.1 b,1,1.1 b,2,2.1 b,3,3.1 [root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted; create table collect_sorted as select key, collect_list(value) from (select * from collect sort by key, order_field, value desc) x group by key ;
Created 01-20-2016 12:07 AM
Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.
Find a simple example below:
drop table collect; create external table collect( key string, order_field int, value double ) row format delimited fields terminated by ',' stored as textfile ;
[root@sandbox ~]# cat collect1.txt a,1,1.0 a,3,3.0 a,2,2.0 b,1,1.0 b,2,2.0 b,3,3.0 [root@sandbox ~]# cat collect2.txt a,1,1.1 a,3,3.1 a,2,2.1 b,1,1.1 b,2,2.1 b,3,3.1 [root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect
drop table IF EXISTS collect_sorted; create table collect_sorted as select key, collect_list(value) from (select * from collect sort by key, order_field, value desc) x group by key ;