Support Questions

Find answers, ask questions, and share your expertise

Extracting data from unstructured logs text from multiple records

avatar
Rising Star

We have a requirement to extract data from unstructured logs data and capture the results in a custom format. We have implemented the solution using the ExtractText and ReplaceText Processors but we are not getting the expected results,

Please find the details of the input data, current implementation, current output, and expected output format below and help us with your expertise to improvise this using the NiFi Processors.

Input File Content :

 

<13>Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog    AgentLogFile=Security   PluginVersion=100.3.1.22  Source=Microsoft-Windows-Security-Auditing      Computer=mycomputer1 OriginatingComputer=102.123.33.1    User= Domain=     EventID=4688 EventIDCode=4688  EventType=8 EventCategory=13312

{122}Aug 18 11:39:11 test234104.test.gmail.ae  PluginVersion=200.3.1.22  Source=Microsoft-Windows-Security-Auditing AgentDevice=WindowsLog    AgentLogFile=Security     Computer=mycomputer2.gmail.com OriginatingComputer=125.123.33.1    EventID=4688_2 EventIDCode=4688_2  EventType=8_2 EventCategory=13312_2

Current Implementation :

NagendraKumar_0-1721654612429.png

We have used Regex within the "ExtractText" processor for extracting the data of 3 attributes

NagendraKumar_1-1721654660453.png

 

Computer = Computer=(.+?)(\t)
event_type = EventType=(.+?)(\t)
eventID = EventID=(.+?)(\t)

Used "ReplaceText" to construct the custom output format,

NagendraKumar_2-1721654819882.png

computer = ${Computer}
event_id = ${eventID}
event_category = ${event_category}
event_type = ${event_type}

Current Output :

computer = mycomputer1
event_id = 4688
event_category = 1
event_type = 8
computer = mycomputer1
event_id = 4688
event_type = 8
computer = mycomputer1
event_id = 4688
event_type = 8

Expected Output Format :

 

computer = mycomputer1
event_id = 4688
event_type = 8


computer = mycomputer2.gmail.com
event_id = 4688_2
event_type = 8_2

This solution works fine when the data is just a single line item. But when there are multiple records, then data from the first record is repeated and extra records are added for empty lines. Please help us with a generic solution that can handle any number of rows in the input file and extract data for multiple attributes using regex.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@NagendraKumar 

This is not something I have messed with much.   The GrokReader is what would be commonly used to parse unstructured data. Your data looks similar to Cisco syslog structure.  While the GrokReader has built in pattern file, you may find yourself needing to define a custom pattern file for your specific data.  You might find this other community post helpful here:
https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cis...

Hopefully you can use the pattern file example provided through the github post form that other community thread to help create a custom pattern file that works for your specific data:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-proce...

Hope you find this information helps you with your use case journey.


Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

5 REPLIES 5

avatar
Master Mentor

@NagendraKumar 

ExtractText is only going to work with a well defined  content structure. So when you have an unknown number of records in a single FlowFile, you would be better to split that multi-record file into single record files in which you can apply your ExtractText and ReplaceText dataflow against.  You can then easily merge those split records back into the one file using a MergeContent with Defragment option.

Since your files have an unknown number of records separated by a blank line, the SplitContent processor can easily used to split source FlowFile into individual record FlowFiles.

MattWho_0-1721668773266.png

 

The "Byte Sequence" is simply two line returns.
After your ExtractText and ReplaceText processors, you can recombine all the splits to one FlowFile using MergeContent as below:

MattWho_1-1721668982695.png

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Rising Star

@MattWho  - Thanks a lot for your valuable input!

This solution should work but I am concerned about the volume of records. We plan to receive 150000000 records (One Hundred fifty Million records in one day). So splitting those many records and merging these might be a costly operation. Is there any other alternative way that we can explore?

avatar
Master Mentor

@NagendraKumar 

You might want to try using the QueryRecord processor or ScriptedTransformRecord processor.  Since you data is unstructured, you could try using the GrokReader and FreeFormTextRecordSetWriter.

I agree that splitting and merging is not ideal with som many FlowFiles.  ExtractText loads FlowFile content in to memory in order to parse it for extracting bits (High heap usage).  MergeContent loads FlowFile metadata (FlowFile Attributes and metadata) in to heap memory for all FlowFiles allocated to merge bins (High Heap usage which can be managed via multiple MergeContent processor sin series limiting max bin FlowFile count).

Hope this helps give you some alternate direction.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Rising Star

Thanks a lot @MattWho for your valuable commands, Please help with configuring the  GrokReader for the below input data as I am unable to find the right documentation for configuring the  GrokReader with unstructed data. 

<13>Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog    AgentLogFile=Security   PluginVersion=100.3.1.22  Source=Microsoft-Windows-Security-Auditing      Computer=mycomputer1 OriginatingComputer=102.123.33.1    User= Domain=     EventID=4688 EventIDCode=4688  EventType=8 EventCategory=13312

avatar
Master Mentor

@NagendraKumar 

This is not something I have messed with much.   The GrokReader is what would be commonly used to parse unstructured data. Your data looks similar to Cisco syslog structure.  While the GrokReader has built in pattern file, you may find yourself needing to define a custom pattern file for your specific data.  You might find this other community post helpful here:
https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cis...

Hopefully you can use the pattern file example provided through the github post form that other community thread to help create a custom pattern file that works for your specific data:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-proce...

Hope you find this information helps you with your use case journey.


Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt