Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

kafka direct + spark streaming + checkpoints + code changes

SOLVED Go to solution

kafka direct + spark streaming + checkpoints + code changes

New Contributor

I am trying to use a kafka direct receiver to process data in a spark stream in a reliable way. I understand that when using checkpointing if anything happens to the driver this will automatically pick up from where it left off previously. So far so good.

 

However, not clear to me is how I am supposed to handle changes to the code in such an approach (e.g. to calculate additional outputs based on the incoming data). Any code changes to the logic of the streams will be ignored as long as there are checkpoints, so basically once I ship this streaming job I can never change it without throwing away the checkpoints. By throwing away the checkpoints I force the job to either reprocess the entire kafka topic (earliest), or start right at the end of the topic possibly skipping over unprocessed data (latest). Either way, not very reliable.

 

Am I missing something here? What is the right architecture for building a reliable kafka streaming job that allows for code updates? Am I supposed to keep track of the offsets myself and not use checkpoints? If so, is there any sort of example out there on how to do this?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: kafka direct + spark streaming + checkpoints + code changes

New Contributor

Yeah checkpoints aren't great for that.  I honestly don't rely on them.
 
There are examples of how to save offsets yourself in  https://github.com/koeninger/kafka-exactly-once/
 
specifically
 
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...
 
The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways).  There's work towards making the new consumer work with spark at
 
https://issues.apache.org/jira/browse/SPARK-12177
 
and
 
https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...

4 REPLIES 4

Re: kafka direct + spark streaming + checkpoints + code changes

New Contributor

Yeah checkpoints aren't great for that.  I honestly don't rely on them.
 
There are examples of how to save offsets yourself in  https://github.com/koeninger/kafka-exactly-once/
 
specifically
 
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...
 
The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways).  There's work towards making the new consumer work with spark at
 
https://issues.apache.org/jira/browse/SPARK-12177
 
and
 
https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...

Highlighted

Re: kafka direct + spark streaming + checkpoints + code changes

New Contributor

Thanks for the response.

 

So, basically listing my options:

1. Implement direct streams and track offsets myself by including a bunch of code based on the example you linked

2. Use the old-style kafka receiver, and wait for it to get upgraded to the new consumer API in a future spark version.

 

Either way avoid using checkpoints.

 

I guess I'll go with the receiver-style kafka stream for now. It works and it's simplest to use.

 

Re: kafka direct + spark streaming + checkpoints + code changes

New Contributor
To be clear the old receiver based consumer doesn't solve the problem for you either. You'll get data loss unless you're using hdfs as a write ahead log, and even then it won't allow for exactly once output semantics.

Re: kafka direct + spark streaming + checkpoints + code changes

New Contributor

Yes, I had figured this out. I'm ok with a little data loss (it's sensor and positioning data, not supercritical that all of it is seen, as long as almost all of it is seen).

 

As far as exactly once semantics... that is such a pandora's box. I'm going to have a sequence of spark jobs connected through kafka. Correct me if I'm wrong, but if I'm writing to kafka from spark, I don't think there is a way to have exactly once output anyway.