Created on 03-15-2016 09:10 AM - edited 09-16-2022 03:09 AM
I am trying to use a kafka direct receiver to process data in a spark stream in a reliable way. I understand that when using checkpointing if anything happens to the driver this will automatically pick up from where it left off previously. So far so good.
However, not clear to me is how I am supposed to handle changes to the code in such an approach (e.g. to calculate additional outputs based on the incoming data). Any code changes to the logic of the streams will be ignored as long as there are checkpoints, so basically once I ship this streaming job I can never change it without throwing away the checkpoints. By throwing away the checkpoints I force the job to either reprocess the entire kafka topic (earliest), or start right at the end of the topic possibly skipping over unprocessed data (latest). Either way, not very reliable.
Am I missing something here? What is the right architecture for building a reliable kafka streaming job that allows for code updates? Am I supposed to keep track of the offsets myself and not use checkpoints? If so, is there any sort of example out there on how to do this?
Created 03-17-2016 08:30 AM
Yeah checkpoints aren't great for that. I honestly don't rely on them.
There are examples of how to save offsets yourself in https://github.com/koeninger/kafka-exactly-once/
specifically
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...
The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways). There's work towards making the new consumer work with spark at
https://issues.apache.org/jira/browse/SPARK-12177
and
https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...
Created 03-17-2016 08:30 AM
Yeah checkpoints aren't great for that. I honestly don't rely on them.
There are examples of how to save offsets yourself in https://github.com/koeninger/kafka-exactly-once/
specifically
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...
The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways). There's work towards making the new consumer work with spark at
https://issues.apache.org/jira/browse/SPARK-12177
and
https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...
Created 03-17-2016 09:14 AM
Thanks for the response.
So, basically listing my options:
1. Implement direct streams and track offsets myself by including a bunch of code based on the example you linked
2. Use the old-style kafka receiver, and wait for it to get upgraded to the new consumer API in a future spark version.
Either way avoid using checkpoints.
I guess I'll go with the receiver-style kafka stream for now. It works and it's simplest to use.
Created 03-23-2016 05:51 AM
Created 03-23-2016 06:14 AM
Yes, I had figured this out. I'm ok with a little data loss (it's sensor and positioning data, not supercritical that all of it is seen, as long as almost all of it is seen).
As far as exactly once semantics... that is such a pandora's box. I'm going to have a sequence of spark jobs connected through kafka. Correct me if I'm wrong, but if I'm writing to kafka from spark, I don't think there is a way to have exactly once output anyway.