Reply
Contributor
Posts: 59
Registered: ‎03-31-2014

Hive Regex Serde for Multiple Line

[ Edited ]

Is there a way to load the data below with single record spanning multiple rows into a Hive table using Regex serde?

 

123,1,hello world,LINEEND
124,0,good luck,LINEEND
with your new
job,LINEEND
125,1,thanks,LINEEND


The last field of the second record spans 3 rows. I used the following serde but no luck:

 

([\\d\\w]*),([\\d\\w]*),([\\S\\s \\n\\t]*),(LINEEND)

 

According to http://myregexp.com/  this parses the second problem record correctly.

Posts: 1,673
Kudos: 330
Solutions: 263
Registered: ‎07-31-2013

Re: Hive Regex Serde for Multiple Line

Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).

We recommend using a non-text format such as SequenceFile, Avro or Parquet to store such data instead, as they don't suffer from such limitations.
Announcements