Support Questions

yjiang · ‎03-22-2016

Is there a way to train the model offline in Spark MLLib, and then use it for online ML in Storm?

hkropp · ‎03-22-2016

You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).

Spark does support (not all) model to be exported to PMML:

http://spark.apache.org/docs/latest/mllib-pmml-model-export.html

(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)

One way could be to use JPMML with Java in Storm:

http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/

https://github.com/jpmml/jpmml-storm

The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.

View solution in original post

hkropp · ‎03-22-2016

You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).

Spark does support (not all) model to be exported to PMML:

http://spark.apache.org/docs/latest/mllib-pmml-model-export.html

(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)

One way could be to use JPMML with Java in Storm:

http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/

https://github.com/jpmml/jpmml-storm

The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.

hkropp · ‎03-22-2016

In in an advanced architecture you would leverage Zookeeper to announce a new model to the topology without taking it offline.

sball · ‎03-22-2016

PMML is certainly a good option, but be aware that Spark does not support the transformation elements of PMML, so you will need to recreate any feature scaling and transformation before the scoring step.

The other thing to note is that many of the Spark Model classes do not depend on the spark context, so you can link spark to you storm topology and just use the Spark Model itself.

This can lead to some unnecessary code in your jar, but has the advantage that you don't need to go through the PMML format.

hkropp · ‎03-22-2016

+1 for the aspect to reuse Spark code itself

Cloudera Community

Support Questions

How to Use Spark MLLib Model in Storm?