Created 07-22-2024 06:33 AM
We have a requirement to extract data from unstructured logs data and capture the results in a custom format. We have implemented the solution using the ExtractText and ReplaceText Processors but we are not getting the expected results,
Please find the details of the input data, current implementation, current output, and expected output format below and help us with your expertise to improvise this using the NiFi Processors.
Input File Content :
<13>Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog AgentLogFile=Security PluginVersion=100.3.1.22 Source=Microsoft-Windows-Security-Auditing Computer=mycomputer1 OriginatingComputer=102.123.33.1 User= Domain= EventID=4688 EventIDCode=4688 EventType=8 EventCategory=13312
{122}Aug 18 11:39:11 test234104.test.gmail.ae PluginVersion=200.3.1.22 Source=Microsoft-Windows-Security-Auditing AgentDevice=WindowsLog AgentLogFile=Security Computer=mycomputer2.gmail.com OriginatingComputer=125.123.33.1 EventID=4688_2 EventIDCode=4688_2 EventType=8_2 EventCategory=13312_2
Current Implementation :
We have used Regex within the "ExtractText" processor for extracting the data of 3 attributes
Computer = Computer=(.+?)(\t)
event_type = EventType=(.+?)(\t)
eventID = EventID=(.+?)(\t)
Used "ReplaceText" to construct the custom output format,
computer = ${Computer}
event_id = ${eventID}
event_category = ${event_category}
event_type = ${event_type}
Current Output :
computer = mycomputer1
event_id = 4688
event_category = 1
event_type = 8
computer = mycomputer1
event_id = 4688
event_type = 8
computer = mycomputer1
event_id = 4688
event_type = 8
Expected Output Format :
computer = mycomputer1
event_id = 4688
event_type = 8
computer = mycomputer2.gmail.com
event_id = 4688_2
event_type = 8_2
This solution works fine when the data is just a single line item. But when there are multiple records, then data from the first record is repeated and extra records are added for empty lines. Please help us with a generic solution that can handle any number of rows in the input file and extract data for multiple attributes using regex.
Created 07-26-2024 06:22 AM
@NagendraKumar
This is not something I have messed with much. The GrokReader is what would be commonly used to parse unstructured data. Your data looks similar to Cisco syslog structure. While the GrokReader has built in pattern file, you may find yourself needing to define a custom pattern file for your specific data. You might find this other community post helpful here:
https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cis...
Hopefully you can use the pattern file example provided through the github post form that other community thread to help create a custom pattern file that works for your specific data:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-proce...
Hope you find this information helps you with your use case journey.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 07-22-2024 10:23 AM
@NagendraKumar
ExtractText is only going to work with a well defined content structure. So when you have an unknown number of records in a single FlowFile, you would be better to split that multi-record file into single record files in which you can apply your ExtractText and ReplaceText dataflow against. You can then easily merge those split records back into the one file using a MergeContent with Defragment option.
Since your files have an unknown number of records separated by a blank line, the SplitContent processor can easily used to split source FlowFile into individual record FlowFiles.
The "Byte Sequence" is simply two line returns.
After your ExtractText and ReplaceText processors, you can recombine all the splits to one FlowFile using MergeContent as below:
Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 07-23-2024 02:44 AM
@MattWho - Thanks a lot for your valuable input!
This solution should work but I am concerned about the volume of records. We plan to receive 150000000 records (One Hundred fifty Million records in one day). So splitting those many records and merging these might be a costly operation. Is there any other alternative way that we can explore?
Created 07-23-2024 05:44 AM
@NagendraKumar
You might want to try using the QueryRecord processor or ScriptedTransformRecord processor. Since you data is unstructured, you could try using the GrokReader and FreeFormTextRecordSetWriter.
I agree that splitting and merging is not ideal with som many FlowFiles. ExtractText loads FlowFile content in to memory in order to parse it for extracting bits (High heap usage). MergeContent loads FlowFile metadata (FlowFile Attributes and metadata) in to heap memory for all FlowFiles allocated to merge bins (High Heap usage which can be managed via multiple MergeContent processor sin series limiting max bin FlowFile count).
Hope this helps give you some alternate direction.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 07-24-2024 07:02 AM
Thanks a lot @MattWho for your valuable commands, Please help with configuring the GrokReader for the below input data as I am unable to find the right documentation for configuring the GrokReader with unstructed data.
<13>Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog AgentLogFile=Security PluginVersion=100.3.1.22 Source=Microsoft-Windows-Security-Auditing Computer=mycomputer1 OriginatingComputer=102.123.33.1 User= Domain= EventID=4688 EventIDCode=4688 EventType=8 EventCategory=13312
Created 07-26-2024 06:22 AM
@NagendraKumar
This is not something I have messed with much. The GrokReader is what would be commonly used to parse unstructured data. Your data looks similar to Cisco syslog structure. While the GrokReader has built in pattern file, you may find yourself needing to define a custom pattern file for your specific data. You might find this other community post helpful here:
https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cis...
Hopefully you can use the pattern file example provided through the github post form that other community thread to help create a custom pattern file that works for your specific data:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-proce...
Hope you find this information helps you with your use case journey.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt