Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Get UNIQUE values of each array in column

New Contributor

Hey guys,

I have a column in my hive db that contains arrays of strings, let's say column1. I now want to create a new column based on this one, but now with arrays that have only the unique strings of the arrays in column1.

How do I best go about this in HiveQL? I've looked around but haven't found a solution I can implement.

Tx!!

Willem

1 ACCEPTED SOLUTION

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...

View solution in original post

1 REPLY 1

That was interesting I learned a bit about Hive today. Yes its possible:

0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id  |        test9.name        |
+-----------+--------------------------+--+
| 1         | ["ben","klaus","klaus"]  |
| 2         | ["ben","klaus","klaus"]  |
+-----------+--------------------------+--+

You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:


0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname)) 
from (select id,flatname  from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id  |       _c1        |
+-----+------------------+--+
| 1   | ["ben","klaus"]  |
| 2   | ["ben","klaus"]  |
+-----+------------------+--+

You can also use a UDF might be faster.

Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want.

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUD...

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.