This processor is slow when working with small files and lots of nesting . If the directory is hammered from about 30,000 files. How can I improve its performance ? 4 established competitive jobs in the config file is 5000 in the content repository max count files. How can even speed up the work , in catalogs constantly sypyatsya files , and does not have time getFile razgrebatt them , and when the amount of files in subdirectories becomes even more , it is slower exponentially.
No, I dont have cluster setup, because in cluster GetFile works more slowly. in java max ram i set 4096m, server have 1 cpu 4 cores, 8 gb ram. bun in task manages java "eats" 2 gb max, no more and 10-20% cpu usage.
I am having the same problem. I am running a single NiFi instance (v 1.0.0) with 12 CPUs and 16 GB of memory. I have a GetFile processor pointed to a flat directory that will contain anywhere between 30,000 - 500,000 small files (~8 KB). GetFile will take around 2 - 15 minutes (for the filecounts I mentioned above) to start publishing files after it is started. I also looked at using ListFile and had nearly identical wait times.
Are there any best practices in the processor configuration that might alleviate some of this lag?
A couple of suggestions:
1. Number of concurrent Tasks
You've mentioned that only 10-20% of the CPU is being used - what is the number of concurrent tasks you've configured for the GetFile processor? (you can find this in the processor configuration under 'scheduling'):
By default each processor will only use a single thread, increasing the 'concurrent tasks' setting for the required processor should help speed things up.
2. Using ListFile & FetchFile
While not specifically related to performance, using ListFile & FetchFile is a more robust way to deal with files in NiFi and is typically recommended over GetFile in general (it gives you more control over completion strategy, keeps state so the same file isn't collected twice etc).
If going with ListFile to FetchFile, you should increase the 'concurrent tasks' setting for the FetchFile processor (ListFile uses a single thread to get the list of files, and increasing concurrent tasks on FetchFile will mean you can pull multiple files in parallel to speed up performance)
I was able to take a closer look at this one and it appears that reading from directories with a large number of files is going to be a problem with both the GetFile and the ListFile processors in their current form. The root of the problem is that the processors are using the java.io.File.listFiles() method to bring back the directory listing. This is known to be a hog with directories containing a large number of files. The filters and batch size properties are applied after the full listing has been pulled back, meaning that you'll have to bring back a list of all files in a directory even if you only want a small subset of them.
A potential solution (for a later version) would be to use the java.nio packages to read the files as a directory stream, allowing you to apply the filter to the stream itself and stop at a configurable batch size. I would also argue that ListFile needs a configurable batch size for this very reason. I will submit an issue for this one.