I am using a GetHDFS Processor with CRON driven strategy : sheduled to run every day at 10am.
I have one input file to read but when the dataflow starts it gets the source file multiple times instead of 1 time (9 times in my case). Why?
As a result, when I write the output dataflow, I get the following warning : file with same name already exists
Should I modify the parameter Plling Interval ? (set to 0 sec by default)
Try setting the cron run schedule to 0 0 10 * * ? instead.
The reason the other cron schedule grabbed the same file multiple times is because the * * for second and minutes meant run every second and every minute for that hour.
Possibility to run every second or minute. In reality this means run as often as possible using the allowable number of concurrent tasks during the 10th hour of each day. I your case it sounds like it was able to run at least 10 times in that one hour.
If you are running a NiFi cluster, by default every node in your cluster will be running this getHDFS processor at 10 am each day. This means every node will be getting a copy of the same files and processing them in the same way.
If you are running a cluster, considering changing the configuration of your getHDFS processor so it runs on primary node only.