Created 05-31-2019 02:39 PM
Basically, I am trying to analyze SQL server log files using Hive.
The layout is the log file is
The row delimiter is CR+LF.
Below are some entries in the log.
2019-05-28 07:29:55.03 Server UTC adjustment: -7:00 2019-05-28 07:29:55.03 Server (c) Microsoft Corporation. 2019-05-28 07:29:55.03 Server All rights reserved. 2019-05-28 07:29:55.03 Server Server process ID is 3368.
There are several posts here regarding fixed-width column layout. But in my case, the last column is identified not by the width but by the row delimiter.
Created 06-01-2019 03:57 PM
Use hive Regex serde and your matching regex will be
(.{22})(.{1})(.{12})(.*)
(.{22}) -> 1st capture group for 22 characters
(.{1}) -> 2nd for 1 character
(.{12}) -> 3rd for 12 characters
(.*) -> 4th capture group matches for rest of the row.
Created 06-01-2019 03:57 PM
Use hive Regex serde and your matching regex will be
(.{22})(.{1})(.{12})(.*)
(.{22}) -> 1st capture group for 22 characters
(.{1}) -> 2nd for 1 character
(.{12}) -> 3rd for 12 characters
(.*) -> 4th capture group matches for rest of the row.
Created 06-02-2019 03:22 AM
Thanks for your help. I am currently on the road and will test your solution once I am back in office.
Created on 06-03-2019 04:14 AM - edited 08-17-2019 03:14 PM
Your code put me on the right track. Thanks again.
However, I got some strange returned records. Here are my guesses of the possible causes.
1. The CR and LF as the row delimiter and they are not the default row delimiter of the regex SerDe. How can I specify the row delimiter?
2. There are two strange characters in the first column of the first row. It might be related to the row delimiter too.
Below is the create-table script.
Create External Table slog(LogTime string, LogSource string, LogMessage string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{46})(.{24})(.*)") LOCATION '/path/to/slog/';
I attached a screenshot of the log file (in notepad ++) and the hive query result.
Created 06-04-2019 03:04 AM