Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Managing Spark Streaming from Ambari?

avatar
Expert Contributor

This may be a simple question but I have searched for information on it and cannot find any. I am exploring various data ingestion tools that can be managed through Ambari (configured, started, stopped, restarted) and I know Flume works this way. I was hoping Kafka-Connect could be done like this but I've seen evidence that isn't so. Now I am looking at Spark Streaming and hoping theres a way to start, stop and restart a Spark Streaming job kind of like you do with Flume by creating custom interceptors that are .jar files and referencing them in the config...?

Any insight would be greatly appreciated.

1 ACCEPTED SOLUTION

avatar
Master Mentor

Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.

Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.

View solution in original post

5 REPLIES 5

avatar
Master Mentor

Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.

Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.

avatar
Expert Contributor

Thanks. We did check out Nifi and like it. Unfortunately, we only have the budget for one cluster which had HDP on it. So now we have to decide if we want to wipe HDP and get HDF or stay with HDP and use a Flume ingestion scheme. Im kinda disappointed that Nifi cannot be included as part of HDP. Im sure there are reasons. For people who can only have one cluster, they are forced to decide between the power of the analytics in HDP or the power of the data stream control of HDF, but not both.

avatar
Master Mentor

nifi is decoupled from Hadoop, you can get by with just a few nodes (3) to get decent throughput but flexibility and ease of use will pay for itself in the long run. We typically recommend to separate HDP and HDF.

avatar
Super Collaborator

Unfortunately, that kind of functionality does not exist for Spark Streaming. Spark Streaming runs as a standard YARN job and YARN commands could be used to start, stop (kill) and re-submit a job. A properly written Spark streaming job should be able to support at-least once or exactly-once semantics through this lifecycle. But other than that there is no UI or other automation support for it.

Zeppelin is designed for interactive analysis and running Spark streaming via Zeppelin is not recommended (other than demos for presentations).

avatar
Expert Contributor

Thanks for the heads up.