Created 06-10-2018 03:07 AM
Hi! Here is a small excerpt of my table:
select * from source_table;
point id station journey 1510764333 "000003825fc3200b8d541e8ee3844522" "St. George" 1 1510764342 "000003825fc3200b8d541e8ee3844522" "St. George" 1 1510764420 "000003825fc3200b8d541e8ee3844522" "Museum" 1 1510764448 "000003825fc3200b8d541e8ee3844522" "Museum" 1 1510764509 "000003825fc3200b8d541e8ee3844522" "Queens Park" 1 1510764779 "000003825fc3200b8d541e8ee3844522" "St. Andrew" 1 1510764781 "000003825fc3200b8d541e8ee3844522" "St. Andrew" 1 1510704290 "0000650b988dabe0cc4d5923e9572548" "Museum" 2 1510704508 "0000650b988dabe0cc4d5923e9572548" "St. George" 2 <br>
I would like all the stations per journey grouped in an array ordered from first to last. The result should look like this:
point id station journey array 1510764333 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510764342 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510764420 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510764448 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510764509 "000003825fc3200b8d541e8ee3844522" "Queens Park" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510764779 "000003825fc3200b8d541e8ee3844522" "St. Andrew" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510704290 "0000650b988dabe0cc4d5923e9572548" "Museum" 2 ["Museum","St.George"] 1510704508 "0000650b988dabe0cc4d5923e9572548" "St. George" 2 ["Museum","St.George"]
Unfortunately I cannot get them into the right order. This code gives me the array but in the incorrect order (see below):
select point, ID, station, journey, collect_set(station) over (partition by journey) station_arr from source_table order by journey, point;
point id station journey array 1510764333 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510764342 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510764420 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510764448 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510764509 "000003825fc3200b8d541e8ee3844522" "Queens Park" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510764779 "000003825fc3200b8d541e8ee3844522" "St. Andrew" 1 ["Museum","St. George","St. Andrew","Queens Park"] 1510704290 "0000650b988dabe0cc4d5923e9572548" "Museum" 2 ["Museum","St.George"] 1510704508 "0000650b988dabe0cc4d5923e9572548" "St. George" 2 ["Museum","St.George"]
Is there a way to create this array and keep the order consistent to the source table? I don't remember having these issues on my sandbox VM and wonder if it has something to do with multiple reducers being involved..?
Thanks
Created 06-11-2018 11:56 PM
Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this:
select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from
(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);
This will take the ordered list and will populate that list to every row in the partition as requested above.
Created 06-10-2018 03:13 AM
I found a similar question here:
https://stackoverflow.com/questions/24805226/keeping-the-order-of-records-in-hive-collect
Hope it helps
Created 06-10-2018 03:37 PM
Hi Sunile, thanks for the post. I've been experimenting with ideas in the link you sent and still cannot get these silly arrays to be ordered correctly. My latest thought is around something like:
select point, id, station, journey, collect_set(station) over (partition by journey order by point) as array from source_table;
Which will give me an output like this:
point id station journey array 1510764333 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["St. George"] 1510764342 "000003825fc3200b8d541e8ee3844522" "St. George" 1 ["St. George"] 1510764420 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["St. George","Museum"] 1510764448 "000003825fc3200b8d541e8ee3844522" "Museum" 1 ["St. George","Museum"] 1510764509 "000003825fc3200b8d541e8ee3844522" "Queens Park" 1 ["St. George","Museum","Queens Park"] 1510764779 "000003825fc3200b8d541e8ee3844522" "St. Andrew" 1 ["St. George","Museum","Queens Park","St. Andrew"] 1510704290 "0000650b988dabe0cc4d5923e9572548" "Museum" 2 ["Museum"] 1510704508 "0000650b988dabe0cc4d5923e9572548" "St. George" 2 ["Museum","St.George"]
Since the order of the last array per journey IS correct I'm trying to get the last_value(array) over (partition by journey) with the hopes of just getting the last value and adding it to each row. Although I can't seem to get it to work yet...
Kind of a hacky solution though- surely there is a simpler way of doing this that I'm missing...
select *, last_value(station_arr) over (partition by id,journey) from
(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);
Created 06-11-2018 11:56 PM
Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this:
select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from
(select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table);
This will take the ordered list and will populate that list to every row in the partition as requested above.