Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Regex Serde for Multiple Line

Hive Regex Serde for Multiple Line

Rising Star

Is there a way to load the data below with single record spanning multiple rows into a Hive table using Regex serde?

 

123,1,hello world,LINEEND
124,0,good luck,LINEEND
with your new
job,LINEEND
125,1,thanks,LINEEND


The last field of the second record spans 3 rows. I used the following serde but no luck:

 

([\\d\\w]*),([\\d\\w]*),([\\S\\s \\n\\t]*),(LINEEND)

 

According to http://myregexp.com/  this parses the second problem record correctly.

1 REPLY 1
Highlighted

Re: Hive Regex Serde for Multiple Line

Master Guru
Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).

We recommend using a non-text format such as SequenceFile, Avro or Parquet to store such data instead, as they don't suffer from such limitations.