- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Create a hive table upon a fixed-width log file but the last column width is not fixed
- Labels:
-
Apache Hive
Created ‎05-31-2019 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Basically, I am trying to analyze SQL server log files using Hive.
The layout is the log file is
- char(23) - for timestamp
- char(1) - space
- char(12) - source
- the rest of the row, and the length varies.
The row delimiter is CR+LF.
Below are some entries in the log.
2019-05-28 07:29:55.03 Server UTC adjustment: -7:00 2019-05-28 07:29:55.03 Server (c) Microsoft Corporation. 2019-05-28 07:29:55.03 Server All rights reserved. 2019-05-28 07:29:55.03 Server Server process ID is 3368.
There are several posts here regarding fixed-width column layout. But in my case, the last column is identified not by the width but by the row delimiter.
Created ‎06-01-2019 03:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use hive Regex serde and your matching regex will be
(.{22})(.{1})(.{12})(.*)
(.{22}) -> 1st capture group for 22 characters
(.{1}) -> 2nd for 1 character
(.{12}) -> 3rd for 12 characters
(.*) -> 4th capture group matches for rest of the row.
Created ‎06-01-2019 03:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use hive Regex serde and your matching regex will be
(.{22})(.{1})(.{12})(.*)
(.{22}) -> 1st capture group for 22 characters
(.{1}) -> 2nd for 1 character
(.{12}) -> 3rd for 12 characters
(.*) -> 4th capture group matches for rest of the row.
Created ‎06-02-2019 03:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help. I am currently on the road and will test your solution once I am back in office.
Created on ‎06-03-2019 04:14 AM - edited ‎08-17-2019 03:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your code put me on the right track. Thanks again.
However, I got some strange returned records. Here are my guesses of the possible causes.
1. The CR and LF as the row delimiter and they are not the default row delimiter of the regex SerDe. How can I specify the row delimiter?
2. There are two strange characters in the first column of the first row. It might be related to the row delimiter too.
Below is the create-table script.
Create External Table slog(LogTime string, LogSource string, LogMessage string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{46})(.{24})(.*)") LOCATION '/path/to/slog/';
I attached a screenshot of the log file (in notepad ++) and the hive query result.
Created ‎06-04-2019 03:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
