Support Questions

Find answers, ask questions, and share your expertise

Limit number of files fetched by directory

avatar
New Contributor

We have a source directory for Getfile which has one thousand subdirectories (i.e. there are one thousand users who each have a Windows share).

Processing issues arise when a user drops several thousand files into their directory.  I presume Getfile scans each directory sequentially and when it finds files, it empties the directory (we delete the source file).  So when it comes across a directory that has several thousand files or a directory that is being constantly written to, that user effectively shuts out everyone for several minutes or tens of minutes.

What I would like to do when coming across a directory with many files is to pick up say one hundred and then move on to the next directory.  This would allow for a more even distribution among users.

There is <path> attribute that distinguishes between users but I'm not sure how to take advantage of that to solve my problem.  Thanks in advance for any tips.

4 REPLIES 4

avatar
Master Mentor

@MikeH 

Sounds like you are regularly ingesting a considerable number of files fro your local filesystem.   Is this a NiFi multi-node cluster or a single standalone instance of NiFi handling this use case?

Both the GetFile and ListFile processors have a "Path Filter" property that takes a Java Regular expression.  You could add multiple processors each with a different regex so they each get from a subset of user sub-directories.

You might consider using the ListFile along with FetchFile processors instead of the GetFile processor.   The ListFile processor produces zero byte FlowFiles (1 FlowFile for each file listed), this processor is then connected to a FetchFile processor which use attributes set on that source file to fetch the content and add it to the FlowFile.  With a NiFi cluster this design approach allows you to redistributed the 0 byte FlowFiles across all nodes in a NiFi cluster so the heavy work of reading in the content and processing each FlowFile is spread across multiple servers(NiFi cluster nodes).  With this approach you can also have many ListFile processor all feeding a single FetchFile.

So perhaps you have a regex for all directories starting with A through C in one processor and another processor for D through F, etc...

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
New Contributor

Thanks Matt, I will look into these ideas.  Unfortunately it is all on one server with one NiFi instance.  Since these are Windows shares, I am looking at restricting the SMB transfer rate but again I only want to slow down the thousand file guy so we'll see.

avatar
Super Guru

Hi @MikeH ,

Have you tried adjusting the File Age Properties. My guess is that when a user drops thousands of files into their own folder it will take time to copy all of them ( depending how big the files are of course ) but lets say on average it takes minutes to copy those files , in this case you can set the Minimum File Age to be 2 minutes , then this will basically pull files that have been setting their for at least 2 minutes, so anything that recently being copied where the modified date is less than 2 minutes wont get picked. I know its not perfect but it will allow for some distribution without being stuck on folder with many files . The more you increase the minimum age the less files you will pick up so you can adjust accordingly.

If that helps please make sure to accept solution.

Thanks

avatar
New Contributor

That's a good idea, however low latency is a user requirement.  Currently, processing each file from source to destination takes around one minute.  If I add a two minute delay, the users would not be happy.