Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Get UNIQUE values of each array in column

avatar

Hey guys,

I have a column in my hive db that contains arrays of strings, let's say column1. I now want to create a new column based on this one, but now with arrays that have only the unique strings of the arrays in column1.

How do I best go about this in HiveQL? I've looked around but haven't found a solution I can implement.

Tx!!

Willem

1 ACCEPTED SOLUTION

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...

View solution in original post

1 REPLY 1

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...