I little confused in understanding each one of them clearly. Can someone here help me in this.
I have set following attribue:
Run Schedule: 30 sec
Polling Interval: 0 sec
In the source directory I have many files (say 10000), and I am writing these files to HDFS
What would be output/expected behavior.
Max Select - represents the maximum number of files pulled in a single connection, in your example it will get two files each time it runs times the number of concurrent tasks
Run Schedule - the amount of time to wait between each task of pulling files, in your example the processor will pul files every 30 seconds
Polling Interval - how long to wait between getting listings of new files
FYI, we refer to these as properties.
So for the example above, the processor will run the first time and get a listing with the 10,000 files and pull two of them, then it will wait 30, seconds and then pull two more files and so on. Basically, the processor will have to run 5,000 times to pull the 10,000 files, with a 30 second wait between tasks, it will take 4,999 x 30 seconds between tasks, so, it will take 149,970 seconds, 41.66 hours, to pull all of the files. Or, about 4 files/minute, 20 files/5 minutes. If you don't write any new files to the directory, then the polling interval could be set even higher. Also, the listing needs a concurrent task and the pulling need a concurrent task, so I would give the processor at least 2 concurrent tasks and reduce the time on the run schedule. Consider increasing the Max selects, at least to 100 the default, because that will be more efficient and faster.
Is there a reason you are pulling only four files per minute?