Created 01-24-2017 07:45 PM
This may be a simple question but I have searched for information on it and cannot find any. I am exploring various data ingestion tools that can be managed through Ambari (configured, started, stopped, restarted) and I know Flume works this way. I was hoping Kafka-Connect could be done like this but I've seen evidence that isn't so. Now I am looking at Spark Streaming and hoping theres a way to start, stop and restart a Spark Streaming job kind of like you do with Flume by creating custom interceptors that are .jar files and referencing them in the config...?
Any insight would be greatly appreciated.
Created 01-29-2017 11:42 AM
Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.
Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.
Created 01-29-2017 11:42 AM
Apache Storm view can provide some functionality with deployed storm topologies to start/pause/restart them.
Best experience would be with Apache NiFi where Ambari manages Nifi operation and full control of data streams or flows in Nifi jargon, will be provided with Nifi OOTB. Nifi has error handling, queing, backpessure, scheduling, expiry of data controls and many more. If you are looking for Flume replacement, Nifi is best bet and best of all, it is decoupled from HDP and offers bidirectional flows of data, to and from Hadoop.
Created 01-30-2017 02:35 PM
Thanks. We did check out Nifi and like it. Unfortunately, we only have the budget for one cluster which had HDP on it. So now we have to decide if we want to wipe HDP and get HDF or stay with HDP and use a Flume ingestion scheme. Im kinda disappointed that Nifi cannot be included as part of HDP. Im sure there are reasons. For people who can only have one cluster, they are forced to decide between the power of the analytics in HDP or the power of the data stream control of HDF, but not both.
Created 01-30-2017 03:39 PM
nifi is decoupled from Hadoop, you can get by with just a few nodes (3) to get decent throughput but flexibility and ease of use will pay for itself in the long run. We typically recommend to separate HDP and HDF.
Created 01-30-2017 03:49 AM
Unfortunately, that kind of functionality does not exist for Spark Streaming. Spark Streaming runs as a standard YARN job and YARN commands could be used to start, stop (kill) and re-submit a job. A properly written Spark streaming job should be able to support at-least once or exactly-once semantics through this lifecycle. But other than that there is no UI or other automation support for it.
Zeppelin is designed for interactive analysis and running Spark streaming via Zeppelin is not recommended (other than demos for presentations).
Created 01-30-2017 02:35 PM
Thanks for the heads up.