Support Questions

Find answers, ask questions, and share your expertise

What is Polling Interval, Max Select , Run Schedule attribute doing in GetSFTP processor in NiFi

avatar
Rising Star

I little confused in understanding each one of them clearly. Can someone here help me in this.

For Example:

I have set following attribue:

Max Select:2

Run Schedule: 30 sec

Polling Interval: 0 sec

In the source directory I have many files (say 10000), and I am writing these files to HDFS

What would be output/expected behavior.

3 REPLIES 3

avatar

@Pradhuman Gupta

Max Select - represents the maximum number of files pulled in a single connection, in your example it will get two files each time it runs times the number of concurrent tasks

Run Schedule - the amount of time to wait between each task of pulling files, in your example the processor will pul files every 30 seconds

Polling Interval - how long to wait between getting listings of new files

FYI, we refer to these as properties.

So for the example above, the processor will run the first time and get a listing with the 10,000 files and pull two of them, then it will wait 30, seconds and then pull two more files and so on. Basically, the processor will have to run 5,000 times to pull the 10,000 files, with a 30 second wait between tasks, it will take 4,999 x 30 seconds between tasks, so, it will take 149,970 seconds, 41.66 hours, to pull all of the files. Or, about 4 files/minute, 20 files/5 minutes. If you don't write any new files to the directory, then the polling interval could be set even higher. Also, the listing needs a concurrent task and the pulling need a concurrent task, so I would give the processor at least 2 concurrent tasks and reduce the time on the run schedule. Consider increasing the Max selects, at least to 100 the default, because that will be more efficient and faster.

Is there a reason you are pulling only four files per minute?

avatar
@Pradhuman Gupta

Did this answer your questions or are you still unclear?

avatar
New Contributor

Sorry, there is a mode to disable polling option?

If i want to load some files in example 3 files and only them and after i want load this to a next rerun of the job as i can to disable the polling function to prevent that the processor is always in waiting or in listening!?

 

Regards,

Daniele