Support Questions

DianaTorres · ‎02-16-2022

I just sow a video with a rather weird claim. It stated that if nifi fell, the offset of the kafkarecordreader will be lost. This however, seems to me a bit weird, since the offset should be kept by kafka to the specific consumer group ID, AFAIK.

Am I m

araujo · ‎02-16-2022

You're correct, @MartinTerreni .

If you set the Consumer Group Id property for the processor that consumes from Kafka the offset will be maintained across NiFi crashes and restarts.

Note, though, that the Consumer Group Id in NiFi is specified by processor, not by record reader.

Would you please share the link to the said video?

Cheers,

Andre

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

araujo · ‎02-16-2022

You're correct, @MartinTerreni .

If you set the Consumer Group Id property for the processor that consumes from Kafka the offset will be maintained across NiFi crashes and restarts.

Note, though, that the Consumer Group Id in NiFi is specified by processor, not by record reader.

Would you please share the link to the said video?

Cheers,

Andre

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

MartinTerreni · ‎02-16-2022

https://www.youtube.com/watch?v=VyzoD8eh-t0

araujo · ‎02-17-2022

Thanks for the link to the video, @MartinTerreni .

In a traditional NiFi flow that read from Kafka and writes to Kafka, the offsets of the source topic are indeed stored by the consumer in Kafka itself. With this, if NiFi crashes or restarts, the flow will continue to read the source topic from the last offset stored in Kafka.

The problem, which @mpayne explained in the video, is that there is a decoupling between the consumer and the producer in a traditional flow. And this can cause data loss or data duplication, depending on the scenario.

For example, the ConsumeKafka processor commits offsets to the source Kafka cluster in batches at regular intervals. It is possible that some messages read by the consumer are written by the producer to the destination topic before the offsets of those records are committed to the source Kafka cluster. If the flow stops abruptly before the commit happens, when it starts again it will start reading from the previously committed offset and it will write duplicate messages to the target topic.

On the other hand, since there's no tight coupling between consumer and producer, the consumer could read some messages and commit their offsets to the source cluster before the ProduceKafka processor is able to deliver those messages to the target cluster. If there's a crash before those messages are sent and some of the flowfiles are lost (e.g. on node of the cluster burned down) that data will never be read again by the consumer and there will be a data gap at the destination.

The new feature explained in the video address all of these issues to guarantee Exactly Once Delivery semantics from Kafka to Kafka.

I hope this helps clarifying it a bit more 🙂

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

MartinTerreni · ‎02-17-2022

Very good clarification, Thank you!

Cloudera Community

Support Questions

Is offset of kafka lost on nifi failure?