@ShellyIsGolden
500k+ files is a lot to list and the lookup on subsequent runs to look for new files.
A few questions first:
- How is your listSFTP processor scheduling configured?
- With the Initial listing, how long does it take to output he 500K+ FlowFiles from time processor is started?
- When files are added to the SFTP server, are they added using a dot rename method?
- Is the last modified timestamp being updated on the files as they are being written to the SFTP server?
So the processor when executed for the initial time will list all files regardless of the configured "Entity Tracking Time Window" set value. Subsequent executions will only list files with and last modified timestamp within the configured "Entity Tracking Time Window" set value. So accurate last modified timestamps are important.
With initial listing of a new processor (or copy of existing processor) there is no step to check list files against the cache entries to see if file has never been listed before or if a listed file has changed in size since last listed. This lookup and comparison does happen on subsequent runs and can use considerable heap.
- Do you see any OutOf Memory (OOM) exceptions in your NiFi app logs?
- Depending on how often the processor executes, consider reducing the configured "Entity Tracking Time Window" value so fewer files are listed in the subsequent executions that need to be looked up. Set it to what is needed with a small buffer between each processor execution. Considering that it sounds you have yoru processor scheduled to execute every 1 minute, maybe try setting this to 30 minutes instead to see what impact it has.
- When you see the issue, does the processor show an active thread in the upper right corner that never seems to go away?
- When the issue appears, rather then copy the processor, what happens if you simply stop the processor (make sure all active threads complete, shows no active threads number in upper right corner of processor) and then just restart it?
In the latest version of Apache NiFi, a "Remote Poll Batch Size" property (defaults to 5000) was added to the listSFTP processor which may help here considering the tremendous amount files being listed in your case.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt