Support Questions

Find answers, ask questions, and share your expertise

Get UNIQUE values of each array in column

avatar
New Contributor

Hey guys,

I have a column in my hive db that contains arrays of strings, let's say column1. I now want to create a new column based on this one, but now with arrays that have only the unique strings of the arrays in column1.

How do I best go about this in HiveQL? I've looked around but haven't found a solution I can implement.

Tx!!

Willem

1 ACCEPTED SOLUTION

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...

View solution in original post

1 REPLY 1

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...