Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Process only one file at a time

avatar
Expert Contributor

I am using ListFile processor to pick up input files to my flow. And it picks up all available files in the directory.

Is it possible to configure it to pick only one file, let its processing to be completed and then pick next file. So basically only one file at a time to be processed by system.

Or if any processor other than ListFile can be used for this purpose?

2 REPLIES 2

avatar
Super Guru

Hi @manishg ,

This has been asked before in a different way but you can implement the same method:

https://community.cloudera.com/t5/Support-Questions/Wait-for-a-Flowfile-to-be-picked-only-after-the-...

If that helps please accept solution.

Thanks

 

avatar
Master Mentor

@manishg 

The ListFile does not pickup any files.  It simply generates a zero content NiFI FlowFile for each file found in the target directory.  That FlowFile only has metadata about the target content.  The FetchFile processor utilizes that metadata to fetch that actual content and add it to the FlowFile.  The value added here happens when you have a lot target files to ingest.  To avoid having all the disk I/o related to that content on one node, you can redistribute the zero byte FlowFiles across all nodes so that each node now in a distributed way fetches the content (This works assuming that same target directory is mounted on all NiFi cluster nodes). 

As @SAMSAL shared you could use Process Group (PG) FlowFile concurrency to accomplish the processing of one FlowFile at a time.

The ListFile will still continue to list all FlowFiles in target directory (writes state and continues to list new files as they get added to input directory).  You can then feed the outbound connection of your ListFile to a PG configured with "Single FlowFile Per Node" FlowFile concurrency.  This will prevent any other FlowFile queued between ListFile and the PG to enter the PG until the first FlowFile has processed through that PG.  
So your first processor inside the PG would be your FetchFile processor.   Now if you were to configure Load Balanced Connection on that connection between ListFile and the PG, You would end up with each node in your NiFi cluster processing a single File at a time.  This gives you some concurrency if you want it.  However, if you have a strict one file at a time, you would not configure load balanced connection.

Hope this helps,

Matt