Created 04-20-2017 02:39 PM
I am using a GetHDFS Processor with CRON driven strategy : sheduled to run every day at 10am.
I have one input file to read but when the dataflow starts it gets the source file multiple times instead of 1 time (9 times in my case). Why?
As a result, when I write the output dataflow, I get the following warning : file with same name already exists
Should I modify the parameter Plling Interval ? (set to 0 sec by default)
Created 04-20-2017 02:44 PM
Created 04-20-2017 02:44 PM
What does your cron run schedule look like?
Created 04-20-2017 02:47 PM
Run schedule : * * 10 * * ?
Created 04-20-2017 03:16 PM
Try setting the cron run schedule to 0 0 10 * * ? instead.
The reason the other cron schedule grabbed the same file multiple times is because the * * for second and minutes meant run every second and every minute for that hour.
Created 04-20-2017 03:18 PM
Possibility to run every second or minute. In reality this means run as often as possible using the allowable number of concurrent tasks during the 10th hour of each day. I your case it sounds like it was able to run at least 10 times in that one hour.
Created 04-20-2017 02:56 PM
Hi @Raphaël MARY,
Did you set a different value for number of concurrent tasks?
Are you in a cluster configuration?
Created 04-20-2017 03:03 PM
No, only one node and 1 concurrent tasks.
I changed to 0 0 10 * * ? in order to specify minutes and seconds.
It is working now!
Created 04-20-2017 03:04 PM
If you are running a NiFi cluster, by default every node in your cluster will be running this getHDFS processor at 10 am each day. This means every node will be getting a copy of the same files and processing them in the same way.
If you are running a cluster, considering changing the configuration of your getHDFS processor so it runs on primary node only.