Member since
03-15-2016
3
Posts
1
Kudos Received
0
Solutions
03-23-2016
06:14 AM
Yes, I had figured this out. I'm ok with a little data loss (it's sensor and positioning data, not supercritical that all of it is seen, as long as almost all of it is seen). As far as exactly once semantics... that is such a pandora's box. I'm going to have a sequence of spark jobs connected through kafka. Correct me if I'm wrong, but if I'm writing to kafka from spark, I don't think there is a way to have exactly once output anyway.
... View more
03-17-2016
09:14 AM
Thanks for the response. So, basically listing my options: 1. Implement direct streams and track offsets myself by including a bunch of code based on the example you linked 2. Use the old-style kafka receiver, and wait for it to get upgraded to the new consumer API in a future spark version. Either way avoid using checkpoints. I guess I'll go with the receiver-style kafka stream for now. It works and it's simplest to use.
... View more
03-15-2016
09:10 AM
1 Kudo
I am trying to use a kafka direct receiver to process data in a spark stream in a reliable way. I understand that when using checkpointing if anything happens to the driver this will automatically pick up from where it left off previously. So far so good. However, not clear to me is how I am supposed to handle changes to the code in such an approach (e.g. to calculate additional outputs based on the incoming data). Any code changes to the logic of the streams will be ignored as long as there are checkpoints, so basically once I ship this streaming job I can never change it without throwing away the checkpoints. By throwing away the checkpoints I force the job to either reprocess the entire kafka topic (earliest), or start right at the end of the topic possibly skipping over unprocessed data (latest). Either way, not very reliable. Am I missing something here? What is the right architecture for building a reliable kafka streaming job that allows for code updates? Am I supposed to keep track of the offsets myself and not use checkpoints? If so, is there any sort of example out there on how to do this?
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache Spark