Created 06-18-2024 12:07 PM
We have a source directory for Getfile which has one thousand subdirectories (i.e. there are one thousand users who each have a Windows share).
Processing issues arise when a user drops several thousand files into their directory. I presume Getfile scans each directory sequentially and when it finds files, it empties the directory (we delete the source file). So when it comes across a directory that has several thousand files or a directory that is being constantly written to, that user effectively shuts out everyone for several minutes or tens of minutes.
What I would like to do when coming across a directory with many files is to pick up say one hundred and then move on to the next directory. This would allow for a more even distribution among users.
There is <path> attribute that distinguishes between users but I'm not sure how to take advantage of that to solve my problem. Thanks in advance for any tips.
Created 06-18-2024 01:17 PM
@MikeH
Sounds like you are regularly ingesting a considerable number of files fro your local filesystem. Is this a NiFi multi-node cluster or a single standalone instance of NiFi handling this use case?
Both the GetFile and ListFile processors have a "Path Filter" property that takes a Java Regular expression. You could add multiple processors each with a different regex so they each get from a subset of user sub-directories.
You might consider using the ListFile along with FetchFile processors instead of the GetFile processor. The ListFile processor produces zero byte FlowFiles (1 FlowFile for each file listed), this processor is then connected to a FetchFile processor which use attributes set on that source file to fetch the content and add it to the FlowFile. With a NiFi cluster this design approach allows you to redistributed the 0 byte FlowFiles across all nodes in a NiFi cluster so the heavy work of reading in the content and processing each FlowFile is spread across multiple servers(NiFi cluster nodes). With this approach you can also have many ListFile processor all feeding a single FetchFile.
So perhaps you have a regex for all directories starting with A through C in one processor and another processor for D through F, etc...
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 06-18-2024 01:45 PM
Thanks Matt, I will look into these ideas. Unfortunately it is all on one server with one NiFi instance. Since these are Windows shares, I am looking at restricting the SMB transfer rate but again I only want to slow down the thousand file guy so we'll see.
Created 06-18-2024 01:27 PM
Hi @MikeH ,
Have you tried adjusting the File Age Properties. My guess is that when a user drops thousands of files into their own folder it will take time to copy all of them ( depending how big the files are of course ) but lets say on average it takes minutes to copy those files , in this case you can set the Minimum File Age to be 2 minutes , then this will basically pull files that have been setting their for at least 2 minutes, so anything that recently being copied where the modified date is less than 2 minutes wont get picked. I know its not perfect but it will allow for some distribution without being stuck on folder with many files . The more you increase the minimum age the less files you will pick up so you can adjust accordingly.
If that helps please make sure to accept solution.
Thanks
Created 06-18-2024 01:47 PM
That's a good idea, however low latency is a user requirement. Currently, processing each file from source to destination takes around one minute. If I add a two minute delay, the users would not be happy.