Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Managing Spark Streaming from Ambari?

Solved Go to solution

Managing Spark Streaming from Ambari?

Rising Star

This may be a simple question but I have searched for information on it and cannot find any. I am exploring various data ingestion tools that can be managed through Ambari (configured, started, stopped, restarted) and I know Flume works this way. I was hoping Kafka-Connect could be done like this but I've seen evidence that isn't so. Now I am looking at Spark Streaming and hoping theres a way to start, stop and restart a Spark Streaming job kind of like you do with Flume by creating custom interceptors that are .jar files and referencing them in the config...?

Any insight would be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Managing Spark Streaming from Ambari?

Mentor

Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.

Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.

5 REPLIES 5

Re: Managing Spark Streaming from Ambari?

Mentor

Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.

Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.

Re: Managing Spark Streaming from Ambari?

Rising Star

Thanks. We did check out Nifi and like it. Unfortunately, we only have the budget for one cluster which had HDP on it. So now we have to decide if we want to wipe HDP and get HDF or stay with HDP and use a Flume ingestion scheme. Im kinda disappointed that Nifi cannot be included as part of HDP. Im sure there are reasons. For people who can only have one cluster, they are forced to decide between the power of the analytics in HDP or the power of the data stream control of HDF, but not both.

Re: Managing Spark Streaming from Ambari?

Mentor

nifi is decoupled from Hadoop, you can get by with just a few nodes (3) to get decent throughput but flexibility and ease of use will pay for itself in the long run. We typically recommend to separate HDP and HDF.

Highlighted

Re: Managing Spark Streaming from Ambari?

Expert Contributor

Unfortunately, that kind of functionality does not exist for Spark Streaming. Spark Streaming runs as a standard YARN job and YARN commands could be used to start, stop (kill) and re-submit a job. A properly written Spark streaming job should be able to support at-least once or exactly-once semantics through this lifecycle. But other than that there is no UI or other automation support for it.

Zeppelin is designed for interactive analysis and running Spark streaming via Zeppelin is not recommended (other than demos for presentations).

Re: Managing Spark Streaming from Ambari?

Rising Star

Thanks for the heads up.

Don't have an account?
Coming from Hortonworks? Activate your account here