<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Nifi partition file by date in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162341#M45155</link>
    <description>&lt;P&gt;Hello, thanks everyone for the prompt response.&lt;/P&gt;&lt;P&gt;With some aid I was able to figure it out
&lt;/P&gt;&lt;P&gt;Mostly my problem was to understand the difference between the Grouping Regular Expression and extracting the date parameter which in my case are pretty much the same expression.&lt;/P&gt;&lt;P&gt;Also I have to admit that the RouteText.Group attribute was not something easy to find even in the documentation.
&lt;/P&gt;&lt;P&gt;I feel that reading a TCP connection with logs and store it partitioned directly to a Hive table should be a fairly common use case, so I'm attaching the template as a grain of sand contribution.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/9094-recordtexttopartition.xml"&gt;recordtexttopartition.xml&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks again&lt;/P&gt;</description>
    <pubDate>Thu, 03 Nov 2016 22:22:35 GMT</pubDate>
    <dc:creator>sciciliani</dc:creator>
    <dc:date>2016-11-03T22:22:35Z</dc:date>
    <item>
      <title>Nifi partition file by date</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162337#M45151</link>
      <description>&lt;P&gt;I'm trying to split a FlowFile into multiple different files by date. &lt;/P&gt;&lt;P&gt;So imagine that you are receiving logs and you want to save as a Hive partitioned table so for example all records with date 2016-01-01 into directory dt=2016-01-01.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2016 06:05:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162337#M45151</guid>
      <dc:creator>sciciliani</dc:creator>
      <dc:date>2016-11-03T06:05:49Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi partition file by date</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162338#M45152</link>
      <description>&lt;P&gt;If the date is an attribute named 'dt' you can use an attribute variable in the directory path using the syntax ${dt} .&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2016 09:02:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162338#M45152</guid>
      <dc:creator>bhopp</dc:creator>
      <dc:date>2016-11-03T09:02:09Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi partition file by date</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162339#M45153</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/14119/sciciliani.html" nodeid="14119" target="_blank"&gt;@Santiago Ciciliani&lt;/A&gt;&lt;P&gt;Do you have any idea how many log lines per FlowFile?&lt;/P&gt;&lt;P&gt;A suggested dataflow may look like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="9069-screen-shot-2016-11-03-at-84646-am.png" style="width: 414px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20588iC8EDACFB697DE054/image-size/medium?v=v2&amp;amp;px=400" role="button" title="9069-screen-shot-2016-11-03-at-84646-am.png" alt="9069-screen-shot-2016-11-03-at-84646-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The SplitText processor is used to breakup your incoming log files in to many smaller FlowFiles that can more easily be handled by the RouteText processor without running out of heap memory.  This is done by setting the line split count property. Depending on how much heap you have configured for your NiFi and size size of each log line really determines how many logs line you can have per split FlowFile.&lt;/P&gt;&lt;P&gt;The RouteText processor evaluates the entire FlowFiles content and routes groups of logs lines to a "dt" relationship:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="9070-screen-shot-2016-11-03-at-85139-am.png" style="width: 579px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20589i8E7845C4C8B2AD13/image-size/medium?v=v2&amp;amp;px=400" role="button" title="9070-screen-shot-2016-11-03-at-85139-am.png" alt="9070-screen-shot-2016-11-03-at-85139-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The UpdateAttribute processor (Optional) will create a "dt" attribute from the "RouteText.Group" attribute. YOu can use thsi attribute later to define the Hive partition table:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="9091-screen-shot-2016-11-03-at-85318-am.png" style="width: 446px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20590i155D982B9408307A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="9091-screen-shot-2016-11-03-at-85318-am.png" alt="9091-screen-shot-2016-11-03-at-85318-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The MergeContent processor (Optional) is used to combine FlowFiles with matching values (dates) in the "RouteText.Group" attribute back in to a single FlowFile.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="9092-screen-shot-2016-11-03-at-85730-am.png" style="width: 478px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20591iDB272B7F5D6F8289/image-size/medium?v=v2&amp;amp;px=400" role="button" title="9092-screen-shot-2016-11-03-at-85730-am.png" alt="9092-screen-shot-2016-11-03-at-85730-am.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Don't forget to set the number of entries and max bin age properties to maximize this processors usage.&lt;/P&gt;&lt;P&gt;Route the "Merged" relationship from this processor to your Hive based processor.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 11:43:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162339#M45153</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2019-08-18T11:43:36Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi partition file by date</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162340#M45154</link>
      <description>&lt;P&gt;From the NIFI User Group Mailing List by &lt;A rel="user" href="https://community.cloudera.com/users/364/jwitt.html" nodeid="364"&gt;@jwitt&lt;/A&gt;:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Split with Grouping:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; Take a look at &lt;STRONG&gt;RouteText&lt;/STRONG&gt;. This allows you to efficiently split up&lt;/P&gt;&lt;P&gt;line oriented data into groups based on matching values rather than&lt;/P&gt;&lt;P&gt;spilt text which will be a line for line split.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Merge Grouped Data:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt; MergeContent&lt;/STRONG&gt; processor will do the trick and you can use correlation&lt;/P&gt;&lt;P&gt;feature to align only those which are from the same group/pattern.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Write to destination:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; You can write directly to HDFS using &lt;STRONG&gt;PutHDFS&lt;/STRONG&gt; or you can prepare the&lt;/P&gt;&lt;P&gt;data and write to Hive.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2016 20:17:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162340#M45154</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-11-03T20:17:12Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi partition file by date</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162341#M45155</link>
      <description>&lt;P&gt;Hello, thanks everyone for the prompt response.&lt;/P&gt;&lt;P&gt;With some aid I was able to figure it out
&lt;/P&gt;&lt;P&gt;Mostly my problem was to understand the difference between the Grouping Regular Expression and extracting the date parameter which in my case are pretty much the same expression.&lt;/P&gt;&lt;P&gt;Also I have to admit that the RouteText.Group attribute was not something easy to find even in the documentation.
&lt;/P&gt;&lt;P&gt;I feel that reading a TCP connection with logs and store it partitioned directly to a Hive table should be a fairly common use case, so I'm attaching the template as a grain of sand contribution.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/9094-recordtexttopartition.xml"&gt;recordtexttopartition.xml&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks again&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2016 22:22:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Nifi-partition-file-by-date/m-p/162341#M45155</guid>
      <dc:creator>sciciliani</dc:creator>
      <dc:date>2016-11-03T22:22:35Z</dc:date>
    </item>
  </channel>
</rss>

