HDF ships NiFi,Kafka,Zookeeper and Storm as default components. Why not Spark? I understand that Spark is not meant for streaming, rather better suited for micro-batching, but any additional reasons?
Thanks for replying. You are right, there is a connector to receive the data from NiFi to Spark. I had tried it by transferring the data to Spark through an output port and it works pretty well 🙂
My question is more on why Hortonworks decided not to include it.
I am also curious on the Machine Learning integration. Please share if there are any good ways to accept requests from NiFi, make predictions using a trained classifier and finally give back the scores as response. Here again, NiFi can accept requests and give back a response, but I am not able to give it to a classifier and get the scores back.
As said a question of priorities. Connectors normally go from a community project into the actual product. Once they are in they have to be tested and supported and often need changes like Kerberos support as well. So including something is not zero effort. Its just lower on the priority list than the Kafka connection which is the standard approach to ingest realtime data into a bigdata environment. If you need the integration you could post an idea in the community board. Our product management teams read them.
Regarding the Machine Learning inquiry please open a new question. This would allow people who have tried this before to more easily see the question. Sorry for not being of more help 🙂
I don't know why it was picked but Apache Storm is very production ready, mature and very well instrumented for metrics, debugging and running real production code. It is also real streaming not micro-batching. Hortonworks also has committers on that project. Spark and Flink are very new and maturing.