I am new into building data pipelines with Kafka and NiFi and I'm testing to build a Nifi flow using Kafka publisher and consumer, so there's a particular doubt I have when using PublishKafka, topics, consumers and ConsumeKafka.
I have 3 Kafka brokers running on 3 nodes, so I created one Kafka topic on each node with the name "test01". Then, when I configure the PublishKafka processor in Nifi, I set the 3 brokers hostnames and their topic name as follows:
* I check the available consumers IDs with the shh line in one of the nodes:
./zookeeper-shell.sh hdf01.local:2181 ls /consumers
And everything works fine, but I still don't understand if it's necessary to create the topics on all the nodes to parallelize, or just creating one would make the same result. Also what's the difference between listing all kafka brokers in the properties or just one?
When you create a topic there are two different concepts - partitions and replication.
If you have 3 brokers and create a topic with 1 partition, then the entire topic exists only on one of those brokers.
If you create a topic with 3 paritions then 1/3 of the topic is on broker 1 as partition 1, 1/3 on broker 2 as partition 2, and 1/3 on broker 3 as partition 3.
If you create a topic with 3 partitions AND replicaiton factor of 2, then its same as above except there is also a copy of each partition on another node. So parition 1 may be on broker 1 with a copy on broker 2, parition 2 maybe be on broker 2 with a copy on broker 3, and partition 3 may be on broker 3 with a copy on broker 1.
In general, replication ensures that if a broker goes down then another broker still has the data, and partition allows for higher read/write throughput by dividing up the data across multiple nodes.