Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Need help creating a custom SerDe.

Solved Go to solution

Need help creating a custom SerDe.

I have a file with the following pattern: K,K,K,K,K,K,KV,KV,KV,KV.....

The initial standalone Keys (K) values are static and never change. The KV (key value pairs) after the keys are dynamic (additional KV pairs can be added or removed at anytime) and need to be listed as a map in Hive. The first K values would be listed as columns.

Does someone have any code for a custom SerDe I can include in the Hive table definition for a file with this structure? Currently we are using a custom UDF with Python but would like to store the files directly in HDFS and only apply the schema at runtime.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Need help creating a custom SerDe.

@Scott Shaw, @Sourygna Luangsay

I created a "minimum-viable-serde" implementing what you described. See if it is what you need.

PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.

from shell:

wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt
echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt
echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt
hadoop fs -mkdir /tmp/testserde/
hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/
hive

from hive:

add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar;
drop table testserde;
create external table testserde (
 field1 string,
 field2 int,
 field3 double,
 maps map<string,string>
)
ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue'
WITH SERDEPROPERTIES (
 "delimiter" = ","
)
LOCATION '/tmp/testserde/';

select * from testserde;

Source code is here:

https://github.com/gbraccialli/HiveUtils

https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde...

PS2: there are lots of TODO yet.

6 REPLIES 6

Re: Need help creating a custom SerDe.

Sorry, the actual format is all comma separated. So after a fixed number of keys (let's assume eight) the pattern then switches to a dynamic number of key\value pairs: k,k,k,k,k,k,k,k,k,v,k,v,k,v,k,v.....

Re: Need help creating a custom SerDe.

Expert Contributor

Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach:

1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column

2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.

Re: Need help creating a custom SerDe.

@Scott Shaw, @Sourygna Luangsay

I created a "minimum-viable-serde" implementing what you described. See if it is what you need.

PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.

from shell:

wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt
echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt
echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt
hadoop fs -mkdir /tmp/testserde/
hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/
hive

from hive:

add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar;
drop table testserde;
create external table testserde (
 field1 string,
 field2 int,
 field3 double,
 maps map<string,string>
)
ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue'
WITH SERDEPROPERTIES (
 "delimiter" = ","
)
LOCATION '/tmp/testserde/';

select * from testserde;

Source code is here:

https://github.com/gbraccialli/HiveUtils

https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde...

PS2: there are lots of TODO yet.

Re: Need help creating a custom SerDe.

Thanks!! This looks great. We'll give it a try.

Re: Need help creating a custom SerDe.

It also works with spark-sql.

Re: Need help creating a custom SerDe.

New Contributor

@Guilherme Braccialli

I have a log file in which i have last field as key value pair.

for example:

2017-11-29 16:19:39,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |9=76|35=p|a=b|c=hg|

2017-11-29 16:20:29,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |3=6|35=w|a=b|

how to analyse this? Can we use your custom serde for this?

Because Regex serde is not supporting complex data types.