Created 04-27-2017 11:22 AM
Hi,
I am unable to visualize how multiple NiFi nodes in a cluster processes a flowfile.
In a NiFi cluster, the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node
Ref:
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
Processors can then be schedule to run on the Primary Node only, via an option on the scheduling tab of the processor which is only available in a cluster.
Ref:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
With the context as described above, let's say at the very top of the process I have a processor of "CRON driven" schedule type and the rest are simply "Timer driven". Assume this flow doesn't involve any Kafka processors within it and there are 2 nodes in the cluster. If the schedule is set to 12.00AM at the top processor start, will it start on both the NiFi nodes? If so, should we set the Execution to "Prmary node" for the top processor, to prevent the processor getting executed in both the NiFi nodes in parallel?
In the second scenario if my top processor is ConsumeKafka_0_10 and if it is of "Timer driven" schedule type, if my topic is configured with 6 partitions and if I set the concurrency to 3, would 3 processor instances run one node and another 3 processor instance on another node?
Created 04-27-2017 12:41 PM
When to use "primary node only" depends on whether the operation is something that makes sense to happen on all nodes, or whether its something that only makes sense to happen once. Here are some examples...
In your Kafka scenario, instances of a processor equate to what you see on the graph times the # of nodes in the cluster, so if you have a two node cluster with one ConsumeKafka_0_10 on the canvas, then there are two instances of ConsumeKafka_0_10. If you increase concurrent tasks to 3, then there are 3 threads executing each instance on each node, so 6 total. Since you have 6 partitions, each of these 6 threads should consume from a separate partition.
Created 04-27-2017 12:41 PM
When to use "primary node only" depends on whether the operation is something that makes sense to happen on all nodes, or whether its something that only makes sense to happen once. Here are some examples...
In your Kafka scenario, instances of a processor equate to what you see on the graph times the # of nodes in the cluster, so if you have a two node cluster with one ConsumeKafka_0_10 on the canvas, then there are two instances of ConsumeKafka_0_10. If you increase concurrent tasks to 3, then there are 3 threads executing each instance on each node, so 6 total. Since you have 6 partitions, each of these 6 threads should consume from a separate partition.