question Re: stream processing runtimes in Support Questions

stream processing runtimes

avijeetd — Thu, 02 Feb 2017 18:42:04 GMT

Hi All,

most of the batch processing frameworks (MR, Spark) support a local mode and a distributed mode (standalone, yarn, mesos) of deployment and execution.

what about stream processing frameworks such as STORM, Spark-streaming? Do they manage the distributed mode on their own? is it even realistic to expect them to be work on YARN?

How to monitor a distributed spark streaming job? And do we need to specify master as yarn to make it distributed?

Thanks,

Avijeet

Re: stream processing runtimes

tkiss — Thu, 02 Feb 2017 19:39:34 GMT

Hello,

Both storm & spark supports local mode.

In Storm you need to create a LocalCluster instance then you can submit your job onto that. You can find description and example in the links:

http://storm.apache.org/releases/1.0.2/Local-mode.html

https://github.com/apache/storm/blob/1.0.x-branch/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java#L98

Spark's approach on local mode is somewhat different. The allocation is controlled through the spark-master variable which can be set to local (or local[*], local[N] where N is a number). If local is specified executors will be started on your machine.

Both Storm and Spark has monitoring capabilities through a web interface. You can find details about them here:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_storm-component-guide/content/using-storm-ui.html

http://spark.apache.org/docs/latest/monitoring.html

Yarn is not a requirement but an option for distributed mode, both Spark & Storm is able to function on their own.

Re: stream processing runtimes

avijeetd — Thu, 02 Feb 2017 19:58:13 GMT

Thanks @Tibor Kiss - I am looking for more information around distributed mode, is there a name to the cluster managers in storm or spark stremaing.

Re: stream processing runtimes

tkiss — Thu, 02 Feb 2017 20:15:37 GMT

In Storm's nomenclature 'nimbus' is the cluster manager:

http://storm.apache.org/releases/1.0.1/Setting-up-a-Storm-cluster.html

Spark calls the cluster manager as 'master':

http://spark.apache.org/docs/latest/spark-standalone.html

Re: stream processing runtimes

avijeetd — Fri, 03 Feb 2017 12:36:23 GMT

That's great @Tibor Kiss - I am trying to run a spark streaming - how do I say to run on standalone cluster mode?