Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Create a hive table upon a fixed-width log file but the last column width is not fixed

avatar
Expert Contributor

Basically, I am trying to analyze SQL server log files using Hive.

The layout is the log file is

  1. char(23) - for timestamp
  2. char(1) - space
  3. char(12) - source
  4. the rest of the row, and the length varies.

The row delimiter is CR+LF.

Below are some entries in the log.

2019-05-28 07:29:55.03 Server      UTC adjustment: -7:00 
2019-05-28 07:29:55.03 Server      (c) Microsoft Corporation. 
2019-05-28 07:29:55.03 Server      All rights reserved. 
2019-05-28 07:29:55.03 Server      Server process ID is 3368.

There are several posts here regarding fixed-width column layout. But in my case, the last column is identified not by the width but by the row delimiter.


1 ACCEPTED SOLUTION

avatar
Master Guru

@Haijin Li

Use hive Regex serde and your matching regex will be

(.{22})(.{1})(.{12})(.*)

(.{22}) -> 1st capture group for 22 characters

(.{1}) -> 2nd for 1 character

(.{12}) -> 3rd for 12 characters

(.*) -> 4th capture group matches for rest of the row.


View solution in original post

4 REPLIES 4

avatar
Master Guru

@Haijin Li

Use hive Regex serde and your matching regex will be

(.{22})(.{1})(.{12})(.*)

(.{22}) -> 1st capture group for 22 characters

(.{1}) -> 2nd for 1 character

(.{12}) -> 3rd for 12 characters

(.*) -> 4th capture group matches for rest of the row.


avatar
Expert Contributor

@Shu

Thanks for your help. I am currently on the road and will test your solution once I am back in office.

avatar
Expert Contributor

@Shu

Your code put me on the right track. Thanks again.

However, I got some strange returned records. Here are my guesses of the possible causes.

1. The CR and LF as the row delimiter and they are not the default row delimiter of the regex SerDe. How can I specify the row delimiter?

2. There are two strange characters in the first column of the first row. It might be related to the row delimiter too.

Below is the create-table script.

Create External Table slog(LogTime string, LogSource string, LogMessage string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{46})(.{24})(.*)") LOCATION '/path/to/slog/';

I attached a screenshot of the log file (in notepad ++) and the hive query result.

109151-captureserde.png


avatar
Master Guru

@Haijin Li

Try with this and this approaches and also to test out the regex serde functionality, Create new file using vi editor in shell and move it to HDFS directory and create table on top of this directory.