Support Questions

Find answers, ask questions, and share your expertise

Hive on Tez: How to order an array column?

avatar

Hi! Here is a small excerpt of my table:

select * from source_table;

point		id					station		journey
1510764333	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	
1510764342	"000003825fc3200b8d541e8ee3844522"	"St. George"	1
1510764420	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	
1510764448	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	
1510764509	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	1	
1510764779	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	1	
1510764781	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	1		
1510704290	"0000650b988dabe0cc4d5923e9572548"	"Museum"	2	
1510704508	"0000650b988dabe0cc4d5923e9572548"	"St. George"	2	<br>

I would like all the stations per journey grouped in an array ordered from first to last. The result should look like this:

point		id					station		journey	array
1510764333	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["St. George","Museum","Queens Park","St. Andrew"]	
1510764342	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["St. George","Museum","Queens Park","St. Andrew"]
1510764420	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["St. George","Museum","Queens Park","St. Andrew"]
1510764448	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["St. George","Museum","Queens Park","St. Andrew"]
1510764509	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	1	["St. George","Museum","Queens Park","St. Andrew"]	
1510764779	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	1	["St. George","Museum","Queens Park","St. Andrew"]			
1510704290	"0000650b988dabe0cc4d5923e9572548"	"Museum"	2	["Museum","St.George"]
1510704508	"0000650b988dabe0cc4d5923e9572548"	"St. George"	2	["Museum","St.George"]

Unfortunately I cannot get them into the right order. This code gives me the array but in the incorrect order (see below):

select point, ID, station, journey, collect_set(station) over (partition by journey) station_arr from source_table order by journey, point;

point		id					station		journey	array
1510764333	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["Museum","St. George","St. Andrew","Queens Park"]	
1510764342	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["Museum","St. George","St. Andrew","Queens Park"]
1510764420	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["Museum","St. George","St. Andrew","Queens Park"]
1510764448	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["Museum","St. George","St. Andrew","Queens Park"]
1510764509	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	1	["Museum","St. George","St. Andrew","Queens Park"]	
1510764779	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	1	["Museum","St. George","St. Andrew","Queens Park"]		
1510704290	"0000650b988dabe0cc4d5923e9572548"	"Museum"	2	["Museum","St.George"]
1510704508	"0000650b988dabe0cc4d5923e9572548"	"St. George"	2	["Museum","St.George"]

Is there a way to create this array and keep the order consistent to the source table? I don't remember having these issues on my sandbox VM and wonder if it has something to do with multiple reducers being involved..?

Thanks

1 ACCEPTED SOLUTION

avatar

Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this:

select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from

(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);


This will take the ordered list and will populate that list to every row in the partition as requested above.

View solution in original post

3 REPLIES 3

avatar
Master Guru

avatar

Hi Sunile, thanks for the post. I've been experimenting with ideas in the link you sent and still cannot get these silly arrays to be ordered correctly. My latest thought is around something like:

select point, id, station, journey, collect_set(station) over (partition by journey order by point) as array from source_table;

Which will give me an output like this:

point		id					station		journey	array
1510764333	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["St. George"]	
1510764342	"000003825fc3200b8d541e8ee3844522"	"St. George"	1	["St. George"]
1510764420	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["St. George","Museum"]
1510764448	"000003825fc3200b8d541e8ee3844522"	"Museum"	1	["St. George","Museum"]
1510764509	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	1	["St. George","Museum","Queens Park"]	
1510764779	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	1	["St. George","Museum","Queens Park","St. Andrew"]			
1510704290	"0000650b988dabe0cc4d5923e9572548"	"Museum"	2	["Museum"]
1510704508	"0000650b988dabe0cc4d5923e9572548"	"St. George"	2	["Museum","St.George"]

Since the order of the last array per journey IS correct I'm trying to get the last_value(array) over (partition by journey) with the hopes of just getting the last value and adding it to each row. Although I can't seem to get it to work yet...
Kind of a hacky solution though- surely there is a simpler way of doing this that I'm missing...

select *, last_value(station_arr) over (partition by id,journey) from

(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);

avatar

Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this:

select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from

(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);


This will take the ordered list and will populate that list to every row in the partition as requested above.