Created 03-22-2016 03:58 PM
Is there a way to train the model offline in Spark MLLib, and then use it for online ML in Storm?
Created 03-22-2016 04:14 PM
You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).
Spark does support (not all) model to be exported to PMML:
http://spark.apache.org/docs/latest/mllib-pmml-model-export.html
(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)
One way could be to use JPMML with Java in Storm:
http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/
https://github.com/jpmml/jpmml-storm
The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.
Created 03-22-2016 04:14 PM
You can use PMML (https://de.wikipedia.org/wiki/Predictive_Model_Markup_Language).
Spark does support (not all) model to be exported to PMML:
http://spark.apache.org/docs/latest/mllib-pmml-model-export.html
(UPDATE: As @Simon Elliston Ball rightfully points out in his answer, in case the PMML model is not supported the Spark libs can be reused as most of them have no dependency to the SparkContext)
One way could be to use JPMML with Java in Storm:
http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/
https://github.com/jpmml/jpmml-storm
The other could be to use R in Storm. I have seen it done, but don't have a reference at hand.
Created 03-22-2016 04:17 PM
In in an advanced architecture you would leverage Zookeeper to announce a new model to the topology without taking it offline.
Created 03-22-2016 05:08 PM
PMML is certainly a good option, but be aware that Spark does not support the transformation elements of PMML, so you will need to recreate any feature scaling and transformation before the scoring step.
The other thing to note is that many of the Spark Model classes do not depend on the spark context, so you can link spark to you storm topology and just use the Spark Model itself.
This can lead to some unnecessary code in your jar, but has the advantage that you don't need to go through the PMML format.
Created 03-22-2016 07:22 PM
+1 for the aspect to reuse Spark code itself