Support Questions

jsebrech · ‎03-15-2016

I am trying to use a kafka direct receiver to process data in a spark stream in a reliable way. I understand that when using checkpointing if anything happens to the driver this will automatically pick up from where it left off previously. So far so good.

However, not clear to me is how I am supposed to handle changes to the code in such an approach (e.g. to calculate additional outputs based on the incoming data). Any code changes to the logic of the streams will be ignored as long as there are checkpoints, so basically once I ship this streaming job I can never change it without throwing away the checkpoints. By throwing away the checkpoints I force the job to either reprocess the entire kafka topic (earliest), or start right at the end of the topic possibly skipping over unprocessed data (latest). Either way, not very reliable.

Am I missing something here? What is the right architecture for building a reliable kafka streaming job that allows for code updates? Am I supposed to keep track of the offsets myself and not use checkpoints? If so, is there any sort of example out there on how to do this?

cody · ‎03-17-2016

Yeah checkpoints aren't great for that. I honestly don't rely on them.

There are examples of how to save offsets yourself in https://github.com/koeninger/kafka-exactly-once/

specifically

https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...

The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways). There's work towards making the new consumer work with spark at

https://issues.apache.org/jira/browse/SPARK-12177

and

https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...

View solution in original post

cody · ‎03-17-2016

Yeah checkpoints aren't great for that. I honestly don't rely on them.

There are examples of how to save offsets yourself in https://github.com/koeninger/kafka-exactly-once/

specifically

https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerP...

The newer Kafka consumer will allow committing offsets to kafka (which still isn't transactional, but should be better than Zookeeper in most ways). There's work towards making the new consumer work with spark at

https://issues.apache.org/jira/browse/SPARK-12177

and

https://github.com/koeninger/spark-1/tree/kafka-0.9/external/kafka-beta/src/main/scala/org/apache/sp...

jsebrech · ‎03-17-2016

Thanks for the response.

So, basically listing my options:

1. Implement direct streams and track offsets myself by including a bunch of code based on the example you linked

2. Use the old-style kafka receiver, and wait for it to get upgraded to the new consumer API in a future spark version.

Either way avoid using checkpoints.

I guess I'll go with the receiver-style kafka stream for now. It works and it's simplest to use.

cody · ‎03-23-2016

To be clear the old receiver based consumer doesn't solve the problem for you either. You'll get data loss unless you're using hdfs as a write ahead log, and even then it won't allow for exactly once output semantics.

jsebrech · ‎03-23-2016

Yes, I had figured this out. I'm ok with a little data loss (it's sensor and positioning data, not supercritical that all of it is seen, as long as almost all of it is seen).

As far as exactly once semantics... that is such a pandora's box. I'm going to have a sequence of spark jobs connected through kafka. Correct me if I'm wrong, but if I'm writing to kafka from spark, I don't think there is a way to have exactly once output anyway.

Cloudera Community

Support Questions

kafka direct + spark streaming + checkpoints + code changes