My use case is to configure custom logs based on NiFi loggings, I would like to:
1) split the logfiles into multiple flowfiles, one per every log record in the loggings
2) Reconstruct the logfiles based on some common pattern in the log record (e.g. ERROR/INFO, processor name, etc)
For #1 i run into problem as the loggings have this form:
2018-01-08 13:43:55,215 ERROR bla bla bla bla bla bla bla ******************SPLIT HERE************** 2018-01-08 13:43:55,215 ERROR bla bla bla bla bla bla bla bla bla bla something else bla bla bla ******************SPLIT HERE************** 2018-01-08 13:43:55,215 ERROR bla bla bla bla ******************SPLIT HERE**************
Each new log record starts with date/time/level information so i could use that to split the big logfile into seperate ones per individual logging, but `SplitText` processor accepts only # of lines and not a regex pattern. `ExtractText` could be another option but then is a bit unclear how all the different regex groups should be configured to get the necessary text only.
Any ideas for this or another approach?
There is no guarantee with logs like NiFi that each log "event" is a single log line.
Some log events are followed by multiple stack trace log lines that I am sure you want to keep in the same FlowFile with the original log line.
The SplitContent processor can split incoming FlowFile based on text sequence. While this can be used here, the sequence property does not support NiFi expression language, so it would not work well when the year rolled over.
May be a valid Jira enhancement request to add EL support to this property in the SplitText processor so text sequence could be dynamically defined per incoming FLowFile.
After you split the content, you could use extract text processor to create a new attribute ( for example: loglevel) that is set to DEBUG, INFO, WARN, ERROR. The FlowFiles could then be routed to a MergeContent processor that bins Flowfiles based on that correlation attribute by setting is in the "Correlation Attribute Name" property.
This will recombine like log lines of same loglevel.
@Matt Clarke That is exactly why i want to split the logs based on regex, so that trace log lines are properly handled. I think i ll go with a fixed sequence for now looking for `2018-` and i hope till next year something has changed!
I could also use ExtractText to group logs like ( ( (*date*)(*log*) )(*date*) ) and then Extract text to flowfile content keeping only ( (*date*)(*log*) ) part, but i dont know if ExtractText can produce multiple flowfiles this way with each match found in the initial logfile. Or save the matches as attributes and then construct new flowfiles with these values.
Did you manage to solve your problem ? I have exactly the same kind of need and I don't find a solution, plus I don't have a fixed sequence (like your '2018'), so I need to split to several flowfiles based on a regex but I don't know how to do that.
I had about the same need as your #1 ("split the logfiles into multiple flowfiles, one per every log record in the loggings") and didn't find at first how to proceed because I didn't manage to find a processor that would do that, but I believe I've found a workaround with 2 processors :
* First Use a ReplaceText to capture your entire log pattern with a RegexReplace, and insert a delimiter at the end (or begining) of your log (a special character or string that you're sure not to find in the log content).
* Then add a SplitContent processor to split the flowfile based on your delimiter (Put 'Use Byte Sequence Format' = Text and 'Byte Sequence' = your delimiter). Make sure 'Keep Byte Sequence' is set to false, so that once split your delimiter is erased. 'Byte Sequence Location' value depends on whether your added the delimiter at the begenning or end of your log.
Now your logs flowfile will be split into separate flowfiles. Hope this helps.