Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Kafka topics and consumers configuration using Nifi

avatar
New Contributor

I am new into building data pipelines with Kafka and NiFi and I'm testing to build a Nifi flow using Kafka publisher and consumer, so there's a particular doubt I have when using PublishKafka, topics, consumers and ConsumeKafka.

I have 3 Kafka brokers running on 3 nodes, so I created one Kafka topic on each node with the name "test01". Then, when I configure the PublishKafka processor in Nifi, I set the 3 brokers hostnames and their topic name as follows:

- Kafka Brokers: hdf01.local:6667, hdf02.local:6667, hdf03.local:6667

- Topic Name: test01

This works fine, I can check the consumers by ssh and they show the data from the flowfiles:

./kafka-console-consumer.sh --zookeeper hdf01.local:2181 --topic test01

So when I configure the ConsumeKafka processor in Nifi, I set the properties as:

- Kafka Brokers: hdf01.local:6667, hdf02.local:6667, hdf03.local:6667

- Topic Name: test01

- Group ID: 91802*

* I check the available consumers IDs with the shh line in one of the nodes:

./zookeeper-shell.sh hdf01.local:2181 ls /consumers

And everything works fine, but I still don't understand if it's necessary to create the topics on all the nodes to parallelize, or just creating one would make the same result. Also what's the difference between listing all kafka brokers in the properties or just one?

Thank you all in advance!

1 REPLY 1

avatar
Master Guru

When you create a topic there are two different concepts - partitions and replication.

If you have 3 brokers and create a topic with 1 partition, then the entire topic exists only on one of those brokers.

If you create a topic with 3 paritions then 1/3 of the topic is on broker 1 as partition 1, 1/3 on broker 2 as partition 2, and 1/3 on broker 3 as partition 3.

If you create a topic with 3 partitions AND replicaiton factor of 2, then its same as above except there is also a copy of each partition on another node. So parition 1 may be on broker 1 with a copy on broker 2, parition 2 maybe be on broker 2 with a copy on broker 3, and partition 3 may be on broker 3 with a copy on broker 1.

In general, replication ensures that if a broker goes down then another broker still has the data, and partition allows for higher read/write throughput by dividing up the data across multiple nodes.