Support Questions
Find answers, ask questions, and share your expertise

how to create an hive table based on a regex processing multiples lines

how to create an hive table based on a regex processing multiples lines

Explorer

Hello,

I would like to create an hive table which take as input something like this:


...
-----------------------------------------------------------------------------------------------------

USER INFO: mickey.zak.nl,mickey Zak,NL,ADMIN
TIME : 15/08/2018 14:26:54
OPERATION : WOA APRROVED ,DATABASE psyche
DATA RECEIVED :
-----------------------------------------------------------------------------------------------------

USER INFO: mickey.zak.nl,mickey Zak,NL,ADMIN
TIME : 15/08/2018 14:26:54
OPERATION : WOA APRROVED ,DATABASE psyche
DATA RECEIVED :
-----------------------------------------------------------------------------------------------------
...

and have 3 fields "USER INFO","TIME","OPERATION".

So my regex have to process a string beginnig with a line of "-" and ending with a line of "-"

I try lot of regex which work fine with java regex, python regex or ruby regex but none of them works with hive, for sample I tried this:

CREATE EXTERNAL TABLE `auditpsyche_regex`(  `userinfo` string COMMENT '',  `time` string COMMENT '',  `operation` string COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex"="(?:.*\\n)?.*INFO: (.*)\\n.*TIME : (../../.... ..:..:..)\\n.*OPERATION : (.*)(?:.*\\s+)*?-----------------------------------------------------------------------------------------------------")LOCATION '/appli/psyche/log/auditpsyche';

But it doesn't work, I can find regex which works well line by line but not mutliline by multiline.

I also tried a very simple regex like 'USER INFO : (.*)\n(.*)' to create a table with two field just to see if it can take multi line but not.

How can I manage to create my hive table?

2 REPLIES 2

Re: how to create an hive table based on a regex processing multiples lines

Super Collaborator

I think you will not be able to do so with SerDe. The SerDe will read record by record, which is line by line, and then the regex is applied on this record that has been read, making it impossible to span the pattern over multiple lines.

One way you could solve it, is try to create a table with SerDe with 1 line for each record and do the combination of multiple lines then via query in Hive.

Another way would be to process the input file first outside Hive and write out the line combined in one line as needed for your table.

Re: how to create an hive table based on a regex processing multiples lines

Explorer

Ok, thanks