I am trying to make improvements to the way we make our Nifi flows by implementing Record processing.
split text > get timestamps using regex > merge on 'corellation_id' (attribute from timestamp: format- yyyy-MM-dd-HH) > extract sourcetype using regex (flow splits at this point for each sourcetype) > merge again on 'correlation_id' > out to HDFS/Splunk etc
flow structured like this:
(each spine is a different sourcetype)
I recently watched this series of videos
I was particularly interested in the first one where Record Processing is used in place of split/merge/regex etc.
As a result I have started to re-design the flow above, the logs come into us in CEF format from a syslogListener (TCP).
My progress so far in processors:
GenerateFlowfile - using some copied data from original flow.
SplitText - to split text/plain content into single logs (not figured out a way to get around splitting)
ParseCEF - parses CEF into JSON format
JoltTransformJSON - 'default' operation - to add some info to the JSON 'extension' content (feedname etc.)
JoltTransformJSON - ''shift' operation - to strip out unwanted lines, just keeping lines I later need as attributes + 'raw_content' line (I haven't figured out how to achieve these two steps using one Jolt processor? is there a way?)
EvaluateJsonPath - pulls the above attributes from the content.
MergeRecord - Currently merging on 'sourcetype'.....
That's as far as I have got, the flow is using a lot less processors and achieving (nearly) the same results. I have also compared the flow file linage duration, my new flow is around 5secs and the old style takes around 35secs (although could be tweaked).
My question is, is there a way to merge on two attributes? 'sourcetype' and 'correlation_id' in my case I don't want different sourcetypes ("deviceVendor" in JSON format) merged together in HDFS and I also want to merge/group by timestamps (correlation_id) in one hour bins.
I want to try to get away from having 11 different spines for each different sourcetype, is the mergeRecord clever enough to group by common content and correlate from my time (correlation_id) attribute? or is this asking too much?
Hope this makes sense, any assistance would be greatly appreciated.