Managing and deploying Spark applications

Interested to hear what others are doing about deploying Spark applications to their clusters.


Currently I use Oozie to manage MapReduce / Hive workflows.  It's not perfect (far from it), but at least the Hue GUI offers a nice view of the workflow and clearly indicates when a stage has failed.


Am interested to hear what people are doing in Spark-land.  Currently I've got a Spark application running nightly.  I'm using Oozie to run a shell script that runs the Spark script with: spark-shell < myscript.scala


That's about as nasty as it gets.  I can think of a couple of alternatives:


  • Build my script into a jar.  Use Oozie / shell task to spark-submit it to the cluster.  That's not a whole lot better than the first case, but I'd probably get a more sensible return code which Oozie could test for (spark-shell always exists successfully, as you'd expect).
  • Write a Spark app with a long running driver that sleeps / loops.  That would let me monitor the application through the Spark Master GUI.  I'm not sure how many resources a long-running driver consumes - does it reserve memory for workers? 
  • A crontab and spark-submit. Easier to configure than Oozie, but with almost no 'free' monitoring available.

Is there an alternative?  What do others do?




Re: Managing and deploying Spark applications

If you use a more recent Oozie release, you can directly use the Spark action instead: