Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFi: Isolated Processors in a clustered

avatar
Expert Contributor

Hi,

I am unable to visualize how multiple NiFi nodes in a cluster processes a flowfile.

In a NiFi cluster, the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node

Ref:

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

Processors can then be schedule to run on the Primary Node only, via an option on the scheduling tab of the processor which is only available in a cluster.

Ref:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

With the context as described above, let's say at the very top of the process I have a processor of "CRON driven" schedule type and the rest are simply "Timer driven". Assume this flow doesn't involve any Kafka processors within it and there are 2 nodes in the cluster. If the schedule is set to 12.00AM at the top processor start, will it start on both the NiFi nodes? If so, should we set the Execution to "Prmary node" for the top processor, to prevent the processor getting executed in both the NiFi nodes in parallel?

In the second scenario if my top processor is ConsumeKafka_0_10 and if it is of "Timer driven" schedule type, if my topic is configured with 6 partitions and if I set the concurrency to 3, would 3 processor instances run one node and another 3 processor instance on another node?

1 ACCEPTED SOLUTION

avatar
Master Guru

When to use "primary node only" depends on whether the operation is something that makes sense to happen on all nodes, or whether its something that only makes sense to happen once. Here are some examples...

  • ListHDFS - this should be primary node only because otherwise you are going to perform the same listing on all nodes
  • ConsumeKafka - this can be run on all nodes because each one will be consuming different data
  • GetFile - this can be run on all nodes because each node will pick up different data from a local directory

In your Kafka scenario, instances of a processor equate to what you see on the graph times the # of nodes in the cluster, so if you have a two node cluster with one ConsumeKafka_0_10 on the canvas, then there are two instances of ConsumeKafka_0_10. If you increase concurrent tasks to 3, then there are 3 threads executing each instance on each node, so 6 total. Since you have 6 partitions, each of these 6 threads should consume from a separate partition.

View solution in original post

1 REPLY 1

avatar
Master Guru

When to use "primary node only" depends on whether the operation is something that makes sense to happen on all nodes, or whether its something that only makes sense to happen once. Here are some examples...

  • ListHDFS - this should be primary node only because otherwise you are going to perform the same listing on all nodes
  • ConsumeKafka - this can be run on all nodes because each one will be consuming different data
  • GetFile - this can be run on all nodes because each node will pick up different data from a local directory

In your Kafka scenario, instances of a processor equate to what you see on the graph times the # of nodes in the cluster, so if you have a two node cluster with one ConsumeKafka_0_10 on the canvas, then there are two instances of ConsumeKafka_0_10. If you increase concurrent tasks to 3, then there are 3 threads executing each instance on each node, so 6 total. Since you have 6 partitions, each of these 6 threads should consume from a separate partition.