Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Handling 3k files by listfile processor

Hi, I had a requirement to handle 3k files by listfile processor in nifi. I would be using ListFile Processor which retrieves a complete listing of all Files in the target directory and then creates a single 0 byte FlowFile for each of them. Then the data is fetched by FetchFile processor which retrieves the content of each of the listed files and inserts that content in to the FlowFile. The issue is if say 3k files are being retrieved byListfile then as its creating flow files for each,its causing flooding of sessions.Currently we are using single processors for this.Is there any possible way to avoid flooding?


Cloudera Employee


Have you tied increasing the yield duration.


Or increase the run schedule

or alternatively on the connector throttle NiFi by lowering the back pressure threshold?


@richard dobson I will try and let you know the the meantime could you please tell me how by changing this would avoid flooding..because each time it calls the listfile (itit would be listing around 3k files..please let me know if iam missing anything..

Cloudera Employee

by increasing the yield and run schedule you are increasing the time between nifi executions and by lowering the back pressure threshold you are throttling the queue. This should reduce the amount of connections.

@richard dobson Hi again..I have tried the above config and its working..but as the listfile processor is listing the files in one go the issue is on the other side,hence on more analysis I could find that the issue is with the FetchFile processor..will it avoid flooding of files by increasing the yield and run schedule on the fetchfile processor as its the fectchfile processor which is creating parallel session for 3k files

@Richard Dobson..I have tried your recommendations and its working fine after i increased the run schedule and concurrent tasks in fetchfile more the 3k files that is getting generated there would be files of large size(around 1 gb).So if we increase the run schedule..then it would take more time to there any way to tackle that?

Cloudera Employee

If you only throttle the fetch then it will only slow down the time between process starts rather than the time it takes to complete a single flow