Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to Use Spark MLLib Model in Storm?

avatar
Expert Contributor

Is there a way to train the model offline in Spark MLLib, and then use it for online ML in Storm?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).

Spark does support (not all) model to be exported to PMML:

http://spark.apache.org/docs/latest/mllib-pmml-model-export.html

(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)

One way could be to use JPMML with Java in Storm:

http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/

https://github.com/jpmml/jpmml-storm

The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).

Spark does support (not all) model to be exported to PMML:

http://spark.apache.org/docs/latest/mllib-pmml-model-export.html

(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)

One way could be to use JPMML with Java in Storm:

http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/

https://github.com/jpmml/jpmml-storm

The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.

avatar
Super Collaborator

In in an advanced architecture you would leverage Zookeeper to announce a new model to the topology without taking it offline.

avatar
Guru

PMML is certainly a good option, but be aware that Spark does not support the transformation elements of PMML, so you will need to recreate any feature scaling and transformation before the scoring step.

The other thing to note is that many of the Spark Model classes do not depend on the spark context, so you can link spark to you storm topology and just use the Spark Model itself.

This can lead to some unnecessary code in your jar, but has the advantage that you don't need to go through the PMML format.

avatar
Super Collaborator

+1 for the aspect to reuse Spark code itself