Posts: 59
Registered: ‎03-31-2014

Hive Regex Serde for Multiple Line

[ Edited ]

Is there a way to load the data below with single record spanning multiple rows into a Hive table using Regex serde?


123,1,hello world,LINEEND
124,0,good luck,LINEEND
with your new

The last field of the second record spans 3 rows. I used the following serde but no luck:


([\\d\\w]*),([\\d\\w]*),([\\S\\s \\n\\t]*),(LINEEND)


According to  this parses the second problem record correctly.

Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: Hive Regex Serde for Multiple Line

Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at:

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).

We recommend using a non-text format such as SequenceFile, Avro or Parquet to store such data instead, as they don't suffer from such limitations.