<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Limit number of files fetched by directory in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389364#M246975</link>
    <description>&lt;P&gt;That's a good idea, however low latency is a user requirement.&amp;nbsp; Currently, processing each file from source to destination takes around one minute.&amp;nbsp; If I add a two minute delay, the users would not be happy.&lt;/P&gt;</description>
    <pubDate>Tue, 18 Jun 2024 20:47:51 GMT</pubDate>
    <dc:creator>MikeH</dc:creator>
    <dc:date>2024-06-18T20:47:51Z</dc:date>
    <item>
      <title>Limit number of files fetched by directory</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389356#M246970</link>
      <description>&lt;P class="lia-align-left"&gt;We have a source directory for Getfile which has one thousand subdirectories (i.e. there are one thousand users who each have a Windows share).&lt;/P&gt;&lt;P&gt;Processing issues arise when a user drops several thousand files into their directory.&amp;nbsp; I presume Getfile scans each directory sequentially and when it finds files, it empties the directory (we delete the source file).&amp;nbsp; So when it comes across a directory that has several thousand files or a directory that is being constantly written to, that user effectively shuts out everyone for several minutes or tens of minutes.&lt;/P&gt;&lt;P&gt;What I would like to do when coming across a directory with many files is to pick up say one hundred and then move on to the next directory.&amp;nbsp; This would allow for a more even distribution among users.&lt;/P&gt;&lt;P&gt;There is &amp;lt;path&amp;gt; attribute that distinguishes between users but I'm not sure how to take advantage of that to solve my problem.&amp;nbsp; Thanks in advance for any tips.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 19:07:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389356#M246970</guid>
      <dc:creator>MikeH</dc:creator>
      <dc:date>2024-06-18T19:07:13Z</dc:date>
    </item>
    <item>
      <title>Re: Limit number of files fetched by directory</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389360#M246971</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/111177"&gt;@MikeH&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Sounds like you are regularly ingesting a considerable number of files fro your local filesystem.&amp;nbsp; &amp;nbsp;Is this a NiFi multi-node cluster or a single standalone instance of NiFi handling this use case?&lt;BR /&gt;&lt;BR /&gt;Both the GetFile and ListFile processors have a "Path Filter" property that takes a Java Regular expression.&amp;nbsp; You could add multiple processors each with a different regex so they each get from a subset of user sub-directories.&lt;BR /&gt;&lt;BR /&gt;You might consider using the ListFile along with FetchFile processors instead of the GetFile processor.&amp;nbsp; &amp;nbsp;The ListFile processor produces zero byte FlowFiles (1 FlowFile for each file listed), this processor is then connected to a FetchFile processor which use attributes set on that source file to fetch the content and add it to the FlowFile.&amp;nbsp; With a NiFi cluster this design approach allows you to redistributed the 0 byte FlowFiles across all nodes in a NiFi cluster so the heavy work of reading in the content and processing each FlowFile is spread across multiple servers(NiFi cluster nodes).&amp;nbsp; With this approach you can also have many ListFile processor all feeding a single FetchFile.&lt;BR /&gt;&lt;BR /&gt;So perhaps you have a regex for all directories starting with A through C in one processor and another processor for D through F, etc...&lt;/P&gt;&lt;P&gt;Please help our community thrive. If you found&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;any&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "&lt;SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;FONT color="#FF0000"&gt;Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/EM&gt;" on&amp;nbsp;&lt;STRONG&gt;one or more&lt;/STRONG&gt;&amp;nbsp;of them that helped.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you,&lt;BR /&gt;Matt&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 20:17:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389360#M246971</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2024-06-18T20:17:27Z</dc:date>
    </item>
    <item>
      <title>Re: Limit number of files fetched by directory</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389362#M246973</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/111177"&gt;@MikeH&lt;/a&gt; ,&lt;/P&gt;&lt;P&gt;Have you tried adjusting the File Age Properties. My guess is that when a user drops thousands of files into their own folder it will take time to copy all of them ( depending how big the files are of course ) but lets say on average it takes minutes to copy those files , in this case you can set the Minimum File Age to be 2 minutes , then this will basically pull files that have been setting their for at least 2 minutes, so anything that recently being copied where the modified date is less than 2 minutes wont get picked. I know its not perfect but it will allow for some distribution without being stuck on folder with many files . The more you increase the minimum age the less files you will pick up so you can adjust accordingly.&lt;/P&gt;&lt;P&gt;If that helps please make sure to accept solution.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 20:27:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389362#M246973</guid>
      <dc:creator>SAMSAL</dc:creator>
      <dc:date>2024-06-18T20:27:57Z</dc:date>
    </item>
    <item>
      <title>Re: Limit number of files fetched by directory</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389363#M246974</link>
      <description>&lt;P&gt;Thanks Matt, I will look into these ideas.&amp;nbsp; Unfortunately it is all on one server with one NiFi instance.&amp;nbsp; Since these are Windows shares, I am looking at restricting the SMB transfer rate but again I only want to slow down the thousand file guy so we'll see.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 20:45:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389363#M246974</guid>
      <dc:creator>MikeH</dc:creator>
      <dc:date>2024-06-18T20:45:37Z</dc:date>
    </item>
    <item>
      <title>Re: Limit number of files fetched by directory</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389364#M246975</link>
      <description>&lt;P&gt;That's a good idea, however low latency is a user requirement.&amp;nbsp; Currently, processing each file from source to destination takes around one minute.&amp;nbsp; If I add a two minute delay, the users would not be happy.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2024 20:47:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Limit-number-of-files-fetched-by-directory/m-p/389364#M246975</guid>
      <dc:creator>MikeH</dc:creator>
      <dc:date>2024-06-18T20:47:51Z</dc:date>
    </item>
  </channel>
</rss>

