Support Questions

Find answers, ask questions, and share your expertise

Spark Streaming 2.0 is it suitable for Low Latency & Event based Real time Analytics?

avatar
Expert Contributor

I know the new features that are coming out in 2.0 like adhoc querying but we are planning to choose a technology for real time usecase which is low latency & event based. As spark streaming is micro-batch not sure if it will be suitable for low latency and event based usecases. I am looking at storm, flink & Apex as well but as currently we have spark for batch processing we wish to go with spark streaming provided it deals with low latency & events based. I have seen companies use Spark streaming where latency was not their priority. Please suggest guys.. Thank you

1 ACCEPTED SOLUTION

avatar
Master Guru

I don't think Spark Streaming 2.0 will change your requirements too much. AFAIK it will provide an easy way to run SQL on top of the stream. ( Might be mistaken so Spark experts feel free to stop me ) However it will not change the underlying architecture or latency considerations.

In the end I think it depends on your workload.

What kind of latency do you expect? You will have a hard to impossible time getting subsecond latency out of Spark Streaming for example. In this case I would go with Storm.

Other reasons for storm:

- out of order processing is much easier ( i.e. have some heavy tuples that can take a long time while processing fast tuples at the same time without blocking )

- I think its easier to have control flows

- Essentially anytime you have a complex flow of multiple input streams that do not do complex joins but work more like control flows I would go with Storm

Spark Streaming:

- You have the full power of Spark at your disposal, data transformation steps like Groupings, Joins etc. should be much more natural.

- all the cool spark tooling and features like ML.

View solution in original post

8 REPLIES 8

avatar

Hi @SparkRocks It's certainly possible, honestly people requiring very low latency (down to a few ms) are mostly still using Storm today, but Spark is getting closer and closer to that point, while it might still be slightly higher latency but it depends how low you want to get.

Storm might be the perfect technology to implement it, but If you're happy with Spark, have experience and knowledge already invested and are generally comfortable with the way Spark does things, I'd say go for doing it in Spark.

Hope that helps.

avatar
Expert Contributor

@Dave Russell thanks for your reply dave. In terms of latency i believe its 1sec but i ll know more about the exact latency this week. How about event driven can spark streaming perform event driven real time analytics as well? thanks

avatar

Latency wise that should be very easy to achieve. As for the event driven nature of what it sounds like you're trying to achieve, I don't see why not.

Hope that helps.

avatar
Master Guru

I don't think Spark Streaming 2.0 will change your requirements too much. AFAIK it will provide an easy way to run SQL on top of the stream. ( Might be mistaken so Spark experts feel free to stop me ) However it will not change the underlying architecture or latency considerations.

In the end I think it depends on your workload.

What kind of latency do you expect? You will have a hard to impossible time getting subsecond latency out of Spark Streaming for example. In this case I would go with Storm.

Other reasons for storm:

- out of order processing is much easier ( i.e. have some heavy tuples that can take a long time while processing fast tuples at the same time without blocking )

- I think its easier to have control flows

- Essentially anytime you have a complex flow of multiple input streams that do not do complex joins but work more like control flows I would go with Storm

Spark Streaming:

- You have the full power of Spark at your disposal, data transformation steps like Groupings, Joins etc. should be much more natural.

- all the cool spark tooling and features like ML.

avatar
Expert Contributor

@Benjamin Leonhardi Thanks for the answer. Do you know if storm 1.0 can integrate with Machine learning algorithms. Also i believe storm is event driven as well??? Actually iam looking for a technology to use either Spark or storm which solves event driven and low latency problem like if a customer is on the website then our streaming application based on some machine learning models present offers like next best product or next best offer in milli second latency to the customer before he moves away from the website. This is the high level requirement. Do you think we can use both spark streaming and storm as we cannot get milli second latency & event processing with spark streaming. Thank you

avatar
Master Guru

@SparkRocks Both can be made event driven but Storm is much easier for this since it is not based on relatively synchronous mini-batches like Spark is. It is more natural to have a control event spout in Storm IMO. I think your usecase sounds more like Storm. But Spark Streaming might be made to work too I suppose.

For your usecase I would look at Storm, perhaps coupled with some standard webservices.

Regarding integration of ML with Storm:

- I have seen an R integration in Storm

https://github.com/allenday/R-Storm

- I have seen the use of a PMML library to run models for example created in Spark in Storm ( subset of models can be exported as PMML

http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/

- I have seen a demo where a spark model context was intantiated in Storm to score a model. @Vadim in this community could help.

IMO JPMML would be the cleanest way to do it though but there are some limitations, since Spark for example only exports a subset of models to PMML and other tools also have limited support for this standard.

avatar
Expert Contributor

@Benjamin Leonhardi Thank you and how about using Spark Streaming & Storm both for this usecase? Not sure if we can integrate spark streaming with Storm may be weird architecture. I think storm as well fits this requirement but we are planning to create a POC on both Storm & spark streaming and then productionize one. Thank you

avatar
Master Guru

The only reason I could see for both would be to run automated modeling tasks in spark and push the models to storm for scoring and event driven prediction.

Sounds like your graph would have different inputs outputs with different speeds which sounds more like Storm. However if you CAN do it all with spark streaming it would be nice of course. I am just dubious regarding latency and the event model.