Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Why does HDF come with Storm and not Spark?

avatar
Contributor

HDF ships NiFi,Kafka,Zookeeper and Storm as default components. Why not Spark? I understand that Spark is not meant for streaming, rather better suited for micro-batching, but any additional reasons? Thanks.

1 ACCEPTED SOLUTION

avatar
Master Guru

There is a Spark Streaming connector available but if its not in the installation then its not supported (yet) .

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

In the end its a question of priorities. Most of the time you would go Nifi->Kafka->Storm/Spark anyway to have a scalable proper bigdata buffer in Kafka.

View solution in original post

7 REPLIES 7

avatar
Master Guru

There is a Spark Streaming connector available but if its not in the installation then its not supported (yet) .

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

In the end its a question of priorities. Most of the time you would go Nifi->Kafka->Storm/Spark anyway to have a scalable proper bigdata buffer in Kafka.

avatar
Contributor

Thank you very much for the reply. If its the case of priority then I understand. 🙂

Sure, I will open the ML query as a separate one.

avatar
Contributor

Hi @Benjamin Leonhardi,

Thanks for replying. You are right, there is a connector to receive the data from NiFi to Spark. I had tried it by transferring the data to Spark through an output port and it works pretty well 🙂

My question is more on why Hortonworks decided not to include it.

I am also curious on the Machine Learning integration. Please share if there are any good ways to accept requests from NiFi, make predictions using a trained classifier and finally give back the scores as response. Here again, NiFi can accept requests and give back a response, but I am not able to give it to a classifier and get the scores back.

Thanks again.

avatar
Master Guru

As said a question of priorities. Connectors normally go from a community project into the actual product. Once they are in they have to be tested and supported and often need changes like Kerberos support as well. So including something is not zero effort. Its just lower on the priority list than the Kafka connection which is the standard approach to ingest realtime data into a bigdata environment. If you need the integration you could post an idea in the community board. Our product management teams read them.

Regarding the Machine Learning inquiry please open a new question. This would allow people who have tried this before to more easily see the question. Sorry for not being of more help 🙂

avatar
Master Guru

Spark, Flink and other frameworks most likely will be added as the community rallies behind adding those.

Connecting from NiFi to Kafka to Spark or to Flink or AKKA is very straight forward

avatar
Contributor

Hi @Timothy Spann,

Can you pls tell me why Storm was chosen as the first framework for HDF? Is it because it offers a real-time streaming environment, as opposed to batch streaming by Spark?

Thanks.

avatar
Master Guru

I don't know why it was picked but Apache Storm is very production ready, mature and very well instrumented for metrics, debugging and running real production code. It is also real streaming not micro-batching. Hortonworks also has committers on that project. Spark and Flink are very new and maturing.