Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Unable to extract status code from Web Server Logs in Nifi

avatar
Super Collaborator

Hi all,

I am using Nifi to extract attributes like IP, timestamp, request type, and status code from the web server logs. This is the sample of my data:

133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566

Using regex in ExtractText Processor to do this operation. I am getting IP, timestamp and request type but not able to extract status code which is 200 in this case. Using (\\d{3}) right now but it is not working. Has anyone tried out this before?

1 ACCEPTED SOLUTION

avatar

Hi,

I'm assuming that you are using multiple capture groups to extract each piece of information. Can you explain what "it is not working" looks like in your situation? Is it capturing nothing, capturing different values than you expected, or throwing an exception? One possibility is that your expression is not focused enough -- if that is the complete expression, it would capture "133" first (as well as "199" and "040" before getting to "200"). If you know the log format will remain consistent, you might want to try something like HTTP\/\d\.\d" (\d{3}). Please let us know if you have any more information and if this solves your problem.

Update: I tested this expression and was able to get the following output:

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Mon Oct 10 12:18:27 PDT 2016'
Key: 'lineageStartDate'
	Value: 'Mon Oct 10 12:18:27 PDT 2016'
Key: 'fileSize'
	Value: '115'
FlowFile Attribute Map Content
Key: 'HTTP response'
	Value: '200'
Key: 'HTTP response.0'
	Value: 'HTTP/1.0" 200'
Key: 'HTTP response.1'
	Value: '200'
Key: 'filename'
	Value: '787130965602970'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: 'ccb6f333-de33-4037-9a1a-aa9ce7f2ef32'
--------------------------------------------------
133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566

I uploaded the template I used here: ExtractText Regex Template.

View solution in original post

2 REPLIES 2

avatar

Hi,

I'm assuming that you are using multiple capture groups to extract each piece of information. Can you explain what "it is not working" looks like in your situation? Is it capturing nothing, capturing different values than you expected, or throwing an exception? One possibility is that your expression is not focused enough -- if that is the complete expression, it would capture "133" first (as well as "199" and "040" before getting to "200"). If you know the log format will remain consistent, you might want to try something like HTTP\/\d\.\d" (\d{3}). Please let us know if you have any more information and if this solves your problem.

Update: I tested this expression and was able to get the following output:

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Mon Oct 10 12:18:27 PDT 2016'
Key: 'lineageStartDate'
	Value: 'Mon Oct 10 12:18:27 PDT 2016'
Key: 'fileSize'
	Value: '115'
FlowFile Attribute Map Content
Key: 'HTTP response'
	Value: '200'
Key: 'HTTP response.0'
	Value: 'HTTP/1.0" 200'
Key: 'HTTP response.1'
	Value: '200'
Key: 'filename'
	Value: '787130965602970'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: 'ccb6f333-de33-4037-9a1a-aa9ce7f2ef32'
--------------------------------------------------
133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566

I uploaded the template I used here: ExtractText Regex Template.

avatar
Super Collaborator

Thank you so much @Andy LoPresto, it worked. It was capturing nothing earlier, perhaps because of other 3 digit numbers. The log format is consistent throughout the file, so yeah, the workflow flowed like a water 🙂