<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Extracting data from unstructured logs text from multiple records in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390849#M247358</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/35454"&gt;@MattWho&lt;/a&gt;&amp;nbsp; - Thanks a lot for your valuable input!&lt;/P&gt;&lt;P&gt;This solution should work but I am concerned about the volume of records. We plan to receive&amp;nbsp;150000000 records (One Hundred fifty Million records in one day). So splitting those many records and merging these might be a costly operation. Is there any other alternative way that we can explore?&lt;/P&gt;</description>
    <pubDate>Tue, 23 Jul 2024 09:44:34 GMT</pubDate>
    <dc:creator>NagendraKumar</dc:creator>
    <dc:date>2024-07-23T09:44:34Z</dc:date>
    <item>
      <title>Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390782#M247335</link>
      <description>&lt;P&gt;We have a requirement to extract data from unstructured logs data and capture the results in a custom format. We have implemented the solution using the ExtractText and ReplaceText Processors but we are not getting the expected results,&lt;/P&gt;&lt;P&gt;Please find the details of the input data, current implementation, current output, and expected output format below and help us with your expertise to improvise this using the NiFi Processors.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Input File Content :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;lt;13&amp;gt;Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog    AgentLogFile=Security   PluginVersion=100.3.1.22  Source=Microsoft-Windows-Security-Auditing      Computer=mycomputer1 OriginatingComputer=102.123.33.1    User= Domain=     EventID=4688 EventIDCode=4688  EventType=8 EventCategory=13312&lt;/P&gt;&lt;P&gt;{122}Aug 18 11:39:11 test234104.test.gmail.ae  PluginVersion=200.3.1.22  Source=Microsoft-Windows-Security-Auditing AgentDevice=WindowsLog    AgentLogFile=Security     Computer=mycomputer2.gmail.com OriginatingComputer=125.123.33.1    EventID=4688_2 EventIDCode=4688_2  EventType=8_2 EventCategory=13312_2&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Current Implementation :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NagendraKumar_0-1721654612429.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/41214i681BB544E5608E41/image-size/medium?v=v2&amp;amp;px=400" role="button" title="NagendraKumar_0-1721654612429.png" alt="NagendraKumar_0-1721654612429.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;We have used Regex within the "ExtractText" processor for extracting the data of 3 attributes&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NagendraKumar_1-1721654660453.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/41215i4D676206DD7CF4FA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="NagendraKumar_1-1721654660453.png" alt="NagendraKumar_1-1721654660453.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Computer =&amp;nbsp;Computer=(.+?)(\t)&lt;BR /&gt;event_type =&amp;nbsp;EventType=(.+?)(\t)&lt;BR /&gt;eventID =&amp;nbsp;EventID=(.+?)(\t)&lt;/P&gt;&lt;P&gt;Used "ReplaceText" to construct the custom output format,&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NagendraKumar_2-1721654819882.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/41216i64012D03A66B7A85/image-size/medium?v=v2&amp;amp;px=400" role="button" title="NagendraKumar_2-1721654819882.png" alt="NagendraKumar_2-1721654819882.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;computer = ${Computer}&lt;BR /&gt;event_id = ${eventID}&lt;BR /&gt;event_category = ${event_category}&lt;BR /&gt;event_type = ${event_type}&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Current Output :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;computer = mycomputer1&lt;BR /&gt;event_id = 4688&lt;BR /&gt;event_category = 1&lt;BR /&gt;event_type = 8&lt;BR /&gt;computer = mycomputer1&lt;BR /&gt;event_id = 4688&lt;BR /&gt;event_type = 8&lt;BR /&gt;computer = mycomputer1&lt;BR /&gt;event_id = 4688&lt;BR /&gt;event_type = 8&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Expected Output Format :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;computer = mycomputer1&lt;BR /&gt;event_id = 4688&lt;BR /&gt;event_type = 8&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;computer = mycomputer2.gmail.com&lt;BR /&gt;event_id = 4688_2&lt;BR /&gt;event_type = 8_2&lt;/P&gt;&lt;P&gt;This solution works fine when the data is just a single line item. But when there are multiple records, then data from the first record is repeated and extra records are added for empty lines. Please help us with a generic solution that can handle any number of rows in the input file and extract data for multiple attributes using regex.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jul 2024 13:33:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390782#M247335</guid>
      <dc:creator>NagendraKumar</dc:creator>
      <dc:date>2024-07-22T13:33:54Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390796#M247340</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/112177"&gt;@NagendraKumar&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;ExtractText is only going to work with a well defined&amp;nbsp; content structure. So when you have an unknown number of records in a single FlowFile, you would be better to split that multi-record file into single record files in which you can apply your ExtractText and ReplaceText dataflow against.&amp;nbsp; You can then easily merge those split records back into the one file using a MergeContent with Defragment option.&lt;BR /&gt;&lt;BR /&gt;Since your files have an unknown number of records separated by a blank line, the SplitContent processor can easily used to split source FlowFile into individual record FlowFiles.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="MattWho_0-1721668773266.png" style="width: 715px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/41217iFACBBA7F12B2C83C/image-dimensions/715x502?v=v2" width="715" height="502" role="button" title="MattWho_0-1721668773266.png" alt="MattWho_0-1721668773266.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The "Byte Sequence" is simply two line returns.&lt;BR /&gt;After your ExtractText and ReplaceText processors, you can recombine all the splits to one FlowFile using MergeContent as below:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="MattWho_1-1721668982695.png" style="width: 713px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/41218iDD0F41DB97994DCD/image-dimensions/713x497?v=v2" width="713" height="497" role="button" title="MattWho_1-1721668982695.png" alt="MattWho_1-1721668982695.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Please help our community grow. If you found&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;any&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "&lt;SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;FONT color="#FF0000"&gt;Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/EM&gt;" on&amp;nbsp;&lt;STRONG&gt;one or more&lt;/STRONG&gt;&amp;nbsp;of them that helped.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you,&lt;BR /&gt;Matt&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jul 2024 17:23:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390796#M247340</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2024-07-22T17:23:40Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390849#M247358</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/35454"&gt;@MattWho&lt;/a&gt;&amp;nbsp; - Thanks a lot for your valuable input!&lt;/P&gt;&lt;P&gt;This solution should work but I am concerned about the volume of records. We plan to receive&amp;nbsp;150000000 records (One Hundred fifty Million records in one day). So splitting those many records and merging these might be a costly operation. Is there any other alternative way that we can explore?&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jul 2024 09:44:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390849#M247358</guid>
      <dc:creator>NagendraKumar</dc:creator>
      <dc:date>2024-07-23T09:44:34Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390854#M247362</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/112177"&gt;@NagendraKumar&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;You might want to try using the &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.27.0/org.apache.nifi.processors.standard.QueryRecord/index.html" target="_blank"&gt;QueryRecord&lt;/A&gt; processor or &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.27.0/org.apache.nifi.processors.script.ScriptedTransformRecord/index.html" target="_blank"&gt;ScriptedTransformRecord&lt;/A&gt; processor.&amp;nbsp; Since you data is unstructured, you could try using the &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.27.0/org.apache.nifi.grok.GrokReader/index.html" target="_blank"&gt;GrokReader&lt;/A&gt; and &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.27.0/org.apache.nifi.text.FreeFormTextRecordSetWriter/index.html" target="_blank"&gt;FreeFormTextRecordSetWriter&lt;/A&gt;.&lt;BR /&gt;&lt;BR /&gt;I agree that splitting and merging is not ideal with som many FlowFiles.&amp;nbsp; ExtractText loads FlowFile content in to memory in order to parse it for extracting bits (High heap usage).&amp;nbsp; MergeContent loads FlowFile metadata (FlowFile Attributes and metadata) in to heap memory for all FlowFiles allocated to merge bins (High Heap usage which can be managed via multiple MergeContent processor sin series limiting max bin FlowFile count).&lt;BR /&gt;&lt;BR /&gt;Hope this helps give you some alternate direction.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Please help our community thrive. If you found&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;any&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "&lt;SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;FONT color="#FF0000"&gt;Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/EM&gt;" on&amp;nbsp;&lt;STRONG&gt;one or more&lt;/STRONG&gt;&amp;nbsp;of them that helped.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you,&lt;BR /&gt;Matt&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jul 2024 12:44:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390854#M247362</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2024-07-23T12:44:59Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390936#M247405</link>
      <description>&lt;P&gt;Thanks a lot&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/35454"&gt;@MattWho&lt;/a&gt;&amp;nbsp;for your valuable commands, Please help with configuring the&amp;nbsp; GrokReader for the below input data as I am unable to find the right documentation for configuring the&amp;nbsp; GrokReader with unstructed data.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;lt;13&amp;gt;Jul 18 11:39:11 test234104.test.gmail.ae AgentDevice=WindowsLog    AgentLogFile=Security   PluginVersion=100.3.1.22  Source=Microsoft-Windows-Security-Auditing      Computer=mycomputer1 OriginatingComputer=102.123.33.1    User= Domain=     EventID=4688 EventIDCode=4688  EventType=8 EventCategory=13312&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jul 2024 14:02:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/390936#M247405</guid>
      <dc:creator>NagendraKumar</dc:creator>
      <dc:date>2024-07-24T14:02:34Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from unstructured logs text from multiple records</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/391017#M247438</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/112177"&gt;@NagendraKumar&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;This is not something I have messed with much.&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;The&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://nifi.apache.org/documentation/nifi-2.0.0-M4/components/org.apache.nifi/nifi-record-serialization-services-nar/2.0.0-M4/org.apache.nifi.grok.GrokReader/index.html" target="_blank" rel="nofollow noopener noreferrer"&gt;GrokReader&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;is what would be commonly used to parse unstructured data. Your data looks similar to Cisco syslog structure.&amp;nbsp; While the GrokReader has built in pattern file, you may find yourself needing to define a custom pattern file for your specific data.&amp;nbsp; You might find this other community post helpful here:&lt;/SPAN&gt;&lt;BR /&gt;&lt;A href="https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cisco-syslog/td-p/233095" target="_blank" rel="noopener"&gt;https://community.cloudera.com/t5/Support-Questions/ExtractGrok-processor-Writing-Regex-to-parse-Cis...&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Hopefully you can use the pattern file example provided through the github post form that other community thread to help create a custom pattern file that works for your specific data:&lt;BR /&gt;&lt;A href="https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/TestExtractGrok/patterns" target="_blank"&gt;https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/TestExtractGrok/patterns&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Hope you find this information helps you with your use case journey.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Please help our community thrive. If you found&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;any&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "&lt;SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;FONT color="#FF0000"&gt;Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/EM&gt;" on&amp;nbsp;&lt;STRONG&gt;one or more&lt;/STRONG&gt;&amp;nbsp;of them that helped.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you,&lt;BR /&gt;Matt&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jul 2024 13:22:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Extracting-data-from-unstructured-logs-text-from-multiple/m-p/391017#M247438</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2024-07-26T13:22:57Z</dc:date>
    </item>
  </channel>
</rss>

