Thanks Matt. "Nodes in a Nifi cluster are not aware of each other." - is this true for concurrent tasks (running on the same node) as well? If this is the case, then increasing the concurrent task to something greater than 1 is always risky for any processor in any flow. The processor executing multiple threads will go in a race condition or generate duplicate flow files or or do the same flow file processing multiple times. For example - if I run something as simple as ListenTCP -> ConvertJSONToAvro -> MergeContent -> PutFile on a single node cluster, and keep Concurrent Task to 1 for each of the processor except ConvertJSONToAvro, for which I keep the value to 2. Then ConvertJSONToAvro can process a single incoming JSON twice and generate 2 records. Or, I am missing something?
... View more
Hi, Can someone please explain NiFi's behaviour in following scenario: The cluster has 4 nodes. There is a GetFile processor polling every minute a shared folder containing thousands of files and running with 2 concurrent tasks. This translate to 8 running threads (as shown in following image): Since all the running instances are reading from a single shared folder, is there a possibility that multiple threads pickup the same file causing duplicate flow files? I know that by "keep source file"=false, we can avoid it. But what will happens when it's set to true? Is there a feature in NiFi framework that safeguards in such scenarios (where a processor tries to read a shared resource from multiple threads across nodes)? Or is it something developer has to handle themselves while writing a custom processor (lets say someone is writing a custom GetDropBoxFile processor). While writing a custom processor, how can we ensure we don't end up in data duplication or a race condition. Does Zk plays any role in such scenarios (maintaining global state of what needs to process, who will process, who is already processing)? Thanks
... View more