Created 10-10-2016 05:00 PM
Hi all,
I am using Nifi to extract attributes like IP, timestamp, request type, and status code from the web server logs. This is the sample of my data:
133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566
Using regex in ExtractText Processor to do this operation. I am getting IP, timestamp and request type but not able to extract status code which is 200 in this case. Using (\\d{3}) right now but it is not working. Has anyone tried out this before?
Created 10-10-2016 07:11 PM
Hi,
I'm assuming that you are using multiple capture groups to extract each piece of information. Can you explain what "it is not working" looks like in your situation? Is it capturing nothing, capturing different values than you expected, or throwing an exception? One possibility is that your expression is not focused enough -- if that is the complete expression, it would capture "133" first (as well as "199" and "040" before getting to "200"). If you know the log format will remain consistent, you might want to try something like
HTTP\/\d\.\d" (\d{3})
. Please let us know if you have any more information and if this solves your problem.
Update: I tested this expression and was able to get the following output:
-------------------------------------------------- Standard FlowFile Attributes Key: 'entryDate' Value: 'Mon Oct 10 12:18:27 PDT 2016' Key: 'lineageStartDate' Value: 'Mon Oct 10 12:18:27 PDT 2016' Key: 'fileSize' Value: '115' FlowFile Attribute Map Content Key: 'HTTP response' Value: '200' Key: 'HTTP response.0' Value: 'HTTP/1.0" 200' Key: 'HTTP response.1' Value: '200' Key: 'filename' Value: '787130965602970' Key: 'path' Value: './' Key: 'uuid' Value: 'ccb6f333-de33-4037-9a1a-aa9ce7f2ef32' -------------------------------------------------- 133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566
I uploaded the template I used here: ExtractText Regex Template.
Created 10-10-2016 07:11 PM
Hi,
I'm assuming that you are using multiple capture groups to extract each piece of information. Can you explain what "it is not working" looks like in your situation? Is it capturing nothing, capturing different values than you expected, or throwing an exception? One possibility is that your expression is not focused enough -- if that is the complete expression, it would capture "133" first (as well as "199" and "040" before getting to "200"). If you know the log format will remain consistent, you might want to try something like
HTTP\/\d\.\d" (\d{3})
. Please let us know if you have any more information and if this solves your problem.
Update: I tested this expression and was able to get the following output:
-------------------------------------------------- Standard FlowFile Attributes Key: 'entryDate' Value: 'Mon Oct 10 12:18:27 PDT 2016' Key: 'lineageStartDate' Value: 'Mon Oct 10 12:18:27 PDT 2016' Key: 'fileSize' Value: '115' FlowFile Attribute Map Content Key: 'HTTP response' Value: '200' Key: 'HTTP response.0' Value: 'HTTP/1.0" 200' Key: 'HTTP response.1' Value: '200' Key: 'filename' Value: '787130965602970' Key: 'path' Value: './' Key: 'uuid' Value: 'ccb6f333-de33-4037-9a1a-aa9ce7f2ef32' -------------------------------------------------- 133.43.96.45 - - [01/Aug/1995:00:00:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 200 10566
I uploaded the template I used here: ExtractText Regex Template.
Created 10-10-2016 09:16 PM
Thank you so much @Andy LoPresto, it worked. It was capturing nothing earlier, perhaps because of other 3 digit numbers. The log format is consistent throughout the file, so yeah, the workflow flowed like a water 🙂