We have a cluster running with 6 nodes. Now when I add a Kafka consumer, each cluster node should pull unique data, as in each node should fetch from a diff partition: https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka.
The same is also mentioned in the nifi docs. However in our case each node is pulling the same data from Kafka leading to duplication. Can you please help. Are there any specific configurations required to get the same done?
The default behavior is the one you described at the beginning, with each node consuming from a different partition.
You should share the Processor's configuration and a describe of the topic.
Check also that the ConsumeKafka processor is compatible with the version of Kafka you are using.
Hi @siddharth pande -
Can you also expand on the type of duplication -- do 4 concurrent tasks equate to duplication of 4x of the originating data on Kafka? Alternatively, are you getting one partition duplicated while others not? Does the issue happen all of the time? Are we sure the duplication isn't happening before it gets into Kafka (say, a producer sending duplicate data?)
Per Raffaele's suggestion, please send over the configuration of the Kafka Processor within Nifi. Also, if you have fake data (or a schema that we could follow that you can easily reproduce), you can share that you can get this to duplicate by publishing to a topic, I'd like to try to reproduce it.