Created 11-11-2015 02:59 PM
I have a file with the following pattern: K,K,K,K,K,K,KV,KV,KV,KV.....
The initial standalone Keys (K) values are static and never change. The KV (key value pairs) after the keys are dynamic (additional KV pairs can be added or removed at anytime) and need to be listed as a map in Hive. The first K values would be listed as columns.
Does someone have any code for a custom SerDe I can include in the Hive table definition for a file with this structure? Currently we are using a custom UDF with Python but would like to store the files directly in HDFS and only apply the schema at runtime.
Created 11-13-2015 03:06 AM
@Scott Shaw, @Sourygna Luangsay
I created a "minimum-viable-serde" implementing what you described. See if it is what you need.
PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.
from shell:
wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt hadoop fs -mkdir /tmp/testserde/ hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/ hive
from hive:
add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar; drop table testserde; create external table testserde ( field1 string, field2 int, field3 double, maps map<string,string> ) ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue' WITH SERDEPROPERTIES ( "delimiter" = "," ) LOCATION '/tmp/testserde/'; select * from testserde;
Source code is here:
https://github.com/gbraccialli/HiveUtils
PS2: there are lots of TODO yet.
Created 11-11-2015 03:22 PM
Sorry, the actual format is all comma separated. So after a fixed number of keys (let's assume eight) the pattern then switches to a dynamic number of key\value pairs: k,k,k,k,k,k,k,k,k,v,k,v,k,v,k,v.....
Created 11-11-2015 03:46 PM
Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach:
1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column
2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.
Created 11-13-2015 03:06 AM
@Scott Shaw, @Sourygna Luangsay
I created a "minimum-viable-serde" implementing what you described. See if it is what you need.
PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table.
from shell:
wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependenc... -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt hadoop fs -mkdir /tmp/testserde/ hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/ hive
from hive:
add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar; drop table testserde; create external table testserde ( field1 string, field2 int, field3 double, maps map<string,string> ) ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue' WITH SERDEPROPERTIES ( "delimiter" = "," ) LOCATION '/tmp/testserde/'; select * from testserde;
Source code is here:
https://github.com/gbraccialli/HiveUtils
PS2: there are lots of TODO yet.
Created 11-13-2015 01:47 PM
Thanks!! This looks great. We'll give it a try.
Created 12-01-2015 10:48 PM
It also works with spark-sql.
Created 03-13-2018 08:54 AM
I have a log file in which i have last field as key value pair.
for example:
2017-11-29 16:19:39,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |9=76|35=p|a=b|c=hg|
2017-11-29 16:20:29,217 DEBUG [pool-4-thread-4] OutBound Msg From Engine -> |3=6|35=w|a=b|
how to analyse this? Can we use your custom serde for this?
Because Regex serde is not supporting complex data types.
 
					
				
				
			
		
