Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Get UNIQUE values of each array in column

avatar
New Contributor

Hey guys,

I have a column in my hive db that contains arrays of strings, let's say column1. I now want to create a new column based on this one, but now with arrays that have only the unique strings of the arrays in column1.

How do I best go about this in HiveQL? I've looked around but haven't found a solution I can implement.

Tx!!

Willem

1 ACCEPTED SOLUTION

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...

View solution in original post

1 REPLY 1

avatar
Master Guru

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...