Posts: 33
Registered: ‎02-15-2016

Multi line regex for Hive Regex Serde

I have been trying to put together a regex to create a Hive external table on a log (text) file, which has some log entries spanning across various lines; but I haven't been successful yet. 


Does anyone have any experience with such a requirement? Does Hive Regex Serde support multiple lines?


Here's the sample text from the log file I am working with:


[2016-12-31T14:30:03.917+00:00] [SomeText] [NOTIFICATION:1] [] [] [ecid: 005HFKbQ7Z_3zyZ8DyW0001DO000CH3,0:1:2:2:1:6] [tid: 514cd700] [13013] Init block, 'Sample Tables', has more variables than the query select list.
[2016-12-31T14:30:03.918+00:00] [SomeText] [NOTIFICATION:1] [] [] [ecid: 005HFKbPIy83zZ8DyW0001DO000CGz,0:1:2:2:1:6] [tid: 50875700] [13013] Init block, 'Sample Tables', has more variables than the query select list.
[2016-12-31T14:56:18.467+00:00] [SomeText] [NOTIFICATION:1] [] [] [ecid: 005HFM4dHTk3z8DyW0001A60020wY,0] [tid: 5fe23700] [59118] Operation Purge Query Plan Cache succeeded!
[2016-12-31T15:00:01.46+00:00] [SomeText] [ERROR:1] [] [] [ecid: 005HFMIa1ln3yZ8DyW0001A600211s,0:6] [tid: 64df7700] [nQSError: 17014] Could not connect to Oracle database. [[
file: server/Query/Execution/DbGateway/Oci8i/../Oci8/Src/SQXOci8.cpp; line: 373
Properties: description=RpScopeVar Exchange; producerID=1405455368; requestID=429096; sessionID=4236224; userName=User;
[nQSError: 17001] Oracle Error code: 12541, message: ORA-12541: TNS:no listener
at OCI call OCILogon.

********** Task: 1. Running for (mls): 82 **********
Description: DB Connect
DSN: TEXT; userName=USER

********** Task: 2. Running for (mls): 82 **********
Description: RpScopeVar GatewayDbGateway Prepare
DSN:Forecasting OLTP Connection Pool;userName:User
SQL:select max(DT) from APPS.TBL where TS is not null and DT <= SYSDATE

********** Task: 3. Running for (mls): 82 **********
Description: Producer::ExecuteOnce()
Activity type:Producer::ExecuteOnce()
Repository Name::Star Subject Area Name:: User Name::User
Logical Hash of SQL:: 0x0




The goal is to write a regex and create Hive table on top of this file. The highlighted text should either be in a single field or broken into two fields - one containing the text before "[[" and 2nd containing everything between "[[" and "]]".


Any help would be greatly appreciated.

Cloudera Employee
Posts: 776
Registered: ‎03-23-2015

Re: Multi line regex for Hive Regex Serde



This is currently not supported as mentioned in the related thread:


As Harsh mentioned, this is due to the fact that it is applied on top of the record reader which has the limitation of reading one line of data. Please follow Harsh's recommendation.