Created 06-30-2016 10:28 AM
HDF ships NiFi,Kafka,Zookeeper and Storm as default components. Why not Spark? I understand that Spark is not meant for streaming, rather better suited for micro-batching, but any additional reasons? Thanks.
Created 06-30-2016 10:32 AM
There is a Spark Streaming connector available but if its not in the installation then its not supported (yet) .
https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
In the end its a question of priorities. Most of the time you would go Nifi->Kafka->Storm/Spark anyway to have a scalable proper bigdata buffer in Kafka.
Created 06-30-2016 10:32 AM
There is a Spark Streaming connector available but if its not in the installation then its not supported (yet) .
https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
In the end its a question of priorities. Most of the time you would go Nifi->Kafka->Storm/Spark anyway to have a scalable proper bigdata buffer in Kafka.
Created 06-30-2016 10:54 AM
Thank you very much for the reply. If its the case of priority then I understand. 🙂
Sure, I will open the ML query as a separate one.
Created 06-30-2016 10:39 AM
Hi @Benjamin Leonhardi,
Thanks for replying. You are right, there is a connector to receive the data from NiFi to Spark. I had tried it by transferring the data to Spark through an output port and it works pretty well 🙂
My question is more on why Hortonworks decided not to include it.
I am also curious on the Machine Learning integration. Please share if there are any good ways to accept requests from NiFi, make predictions using a trained classifier and finally give back the scores as response. Here again, NiFi can accept requests and give back a response, but I am not able to give it to a classifier and get the scores back.
Thanks again.
Created 06-30-2016 10:45 AM
As said a question of priorities. Connectors normally go from a community project into the actual product. Once they are in they have to be tested and supported and often need changes like Kerberos support as well. So including something is not zero effort. Its just lower on the priority list than the Kafka connection which is the standard approach to ingest realtime data into a bigdata environment. If you need the integration you could post an idea in the community board. Our product management teams read them.
Regarding the Machine Learning inquiry please open a new question. This would allow people who have tried this before to more easily see the question. Sorry for not being of more help 🙂
Created 07-07-2016 02:12 PM
Spark, Flink and other frameworks most likely will be added as the community rallies behind adding those.
Connecting from NiFi to Kafka to Spark or to Flink or AKKA is very straight forward
Created 07-08-2016 05:58 AM
Hi @Timothy Spann,
Can you pls tell me why Storm was chosen as the first framework for HDF? Is it because it offers a real-time streaming environment, as opposed to batch streaming by Spark?
Thanks.
Created 07-08-2016 12:34 PM
I don't know why it was picked but Apache Storm is very production ready, mature and very well instrumented for metrics, debugging and running real production code. It is also real streaming not micro-batching. Hortonworks also has committers on that project. Spark and Flink are very new and maturing.