03-15-2016 09:10 AM
I am trying to use a kafka direct receiver to process data in a spark stream in a reliable way. I understand that when using checkpointing if anything happens to the driver this will automatically pick up from where it left off previously. So far so good.
However, not clear to me is how I am supposed to handle changes to the code in such an approach (e.g. to calculate additional outputs based on the incoming data). Any code changes to the logic of the streams will be ignored as long as there are checkpoints, so basically once I ship this streaming job I can never change it without throwing away the checkpoints. By throwing away the checkpoints I force the job to either reprocess the entire kafka topic (earliest), or start right at the end of the topic possibly skipping over unprocessed data (latest). Either way, not very reliable.
Am I missing something here? What is the right architecture for building a reliable kafka streaming job that allows for code updates? Am I supposed to keep track of the offsets myself and not use checkpoints? If so, is there any sort of example out there on how to do this?
03-17-2016 08:30 AM
Yeah checkpoints aren't great for that. I honestly don't rely on them.
There are examples of how to save offsets yourself in https://github.com/koeninger/kafka-exactly-once/
The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways). There's work towards making the new consumer work with spark at
03-17-2016 09:14 AM
Thanks for the response.
So, basically listing my options:
1. Implement direct streams and track offsets myself by including a bunch of code based on the example you linked
2. Use the old-style kafka receiver, and wait for it to get upgraded to the new consumer API in a future spark version.
Either way avoid using checkpoints.
I guess I'll go with the receiver-style kafka stream for now. It works and it's simplest to use.
03-23-2016 05:51 AM
03-23-2016 06:14 AM
Yes, I had figured this out. I'm ok with a little data loss (it's sensor and positioning data, not supercritical that all of it is seen, as long as almost all of it is seen).
As far as exactly once semantics... that is such a pandora's box. I'm going to have a sequence of spark jobs connected through kafka. Correct me if I'm wrong, but if I'm writing to kafka from spark, I don't think there is a way to have exactly once output anyway.