Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Need help creating a custom SerDe.

avatar

I have a file with the following pattern: K,K,K,K,K,K,KV,KV,KV,KV.....

The initial standalone Keys (K) values are static and never change. The KV (key value pairs) after the keys are dynamic (additional KV pairs can be added or removed at anytime) and need to be listed as a map in Hive. The first K values would be listed as columns.

Does someone have any code for a custom SerDe I can include in the Hive table definition for a file with this structure? Currently we are using a custom UDF with Python but would like to store the files directly in HDFS and only apply the schema at runtime.

1 ACCEPTED SOLUTION

avatar

@Scott Shaw, @Sourygna Luangsay

I created a "minimum-viable-serde" implementing what you described. See if it is what you need.

PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.

from shell:

wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt
echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt
echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt
hadoop fs -mkdir /tmp/testserde/
hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/
hive

from hive:

add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar;
drop table testserde;
create external table testserde (
 field1 string,
 field2 int,
 field3 double,
 maps map<string,string>
)
ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue'
WITH SERDEPROPERTIES (
 "delimiter" = ","
)
LOCATION '/tmp/testserde/';

select * from testserde;

Source code is here:

https://github.com/gbraccialli/HiveUtils

https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde...

PS2: there are lots of TODO yet.

View solution in original post

6 REPLIES 6

avatar

Sorry, the actual format is all comma separated. So after a fixed number of keys (let's assume eight) the pattern then switches to a dynamic number of key\value pairs: k,k,k,k,k,k,k,k,k,v,k,v,k,v,k,v.....

avatar
Super Collaborator

Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach:

1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column

2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.

avatar

@Scott Shaw, @Sourygna Luangsay

I created a "minimum-viable-serde" implementing what you described. See if it is what you need.

PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.

from shell:

wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt
echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt
echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt
hadoop fs -mkdir /tmp/testserde/
hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/
hive

from hive:

add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar;
drop table testserde;
create external table testserde (
 field1 string,
 field2 int,
 field3 double,
 maps map<string,string>
)
ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue'
WITH SERDEPROPERTIES (
 "delimiter" = ","
)
LOCATION '/tmp/testserde/';

select * from testserde;

Source code is here:

https://github.com/gbraccialli/HiveUtils

https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde...

PS2: there are lots of TODO yet.

avatar

Thanks!! This looks great. We'll give it a try.

avatar

It also works with spark-sql.

avatar
Explorer

@Guilherme Braccialli

I have a log file in which i have last field as key value pair.

for example:

2017-11-29 16:19:39,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |9=76|35=p|a=b|c=hg|

2017-11-29 16:20:29,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |3=6|35=w|a=b|

how to analyse this? Can we use your custom serde for this?

Because Regex serde is not supporting complex data types.