Support Questions
Find answers, ask questions, and share your expertise

Nifi Cluster reading duplicate data from kafka



We have a cluster running with 6 nodes. Now when I add a Kafka consumer, each cluster node should pull unique data, as in each node should fetch from a diff partition:

The same is also mentioned in the nifi docs. However in our case each node is pulling the same data from Kafka leading to duplication. Can you please help. Are there any specific configurations required to get the same done?



@siddharth pande

The default behavior is the one you described at the beginning, with each node consuming from a different partition.

You should share the Processor's configuration and a describe of the topic.

Check also that the ConsumeKafka processor is compatible with the version of Kafka you are using.


@Rafeeq Shanavaz: My nifi version is 1.7.1 and consume kafka version is, not sure whether they are compatible or not. Can this be an issue?


Version is not an issue, got the same issue when using 1.7.1 consume kafka and 1.7.1 nifi


Which version of Kafka are you using? (not ConsumeKafka)
Can you also post the configuration?

Cloudera Employee

Hi @siddharth pande -

Can you also expand on the type of duplication -- do 4 concurrent tasks equate to duplication of 4x of the originating data on Kafka? Alternatively, are you getting one partition duplicated while others not? Does the issue happen all of the time? Are we sure the duplication isn't happening before it gets into Kafka (say, a producer sending duplicate data?)

Per Raffaele's suggestion, please send over the configuration of the Kafka Processor within Nifi. Also, if you have fake data (or a schema that we could follow that you can easily reproduce), you can share that you can get this to duplicate by publishing to a topic, I'd like to try to reproduce it.