Reply
Explorer
Posts: 28
Registered: ‎03-25-2017

Why Cloudera dont have Storm as a part of CDH release yet?

Hi,

 

I want to know why cloudera dont have STORM as a part of CDH where as Hortonworks have it? Does Cloudera have any other real time proccesing  component which can replace it?

 

Thanks 

Sidharth

Cloudera Employee
Posts: 463
Registered: ‎08-11-2014

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Spark Streaming fills this role in the CDH distribution. I think Storm has been superseded by things like Heron anyway.

Explorer
Posts: 28
Registered: ‎03-25-2017

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Thanks for your response. But Streaming is still a batch processing and it pulls data in batches and execute it. And still spark streaming have issues and not stable like flume for production. Please correct me if iam wrong.

Cloudera Employee
Posts: 463
Registered: ‎08-11-2014

Re: Why Cloudera dont have Storm as a part of CDH release yet?

I am not sure what you're referring to. SS has been production-supported for a couple years. Spark Streaming is micro-batch, but almost all real use cases are micro-batch. If Kafka is upstream or you're doing any kind of CEP, you already have microbatch on your hands.  You can run Storm yourself if you want to, but I'd ask yourself what you're trying to accomplish by that.

Explorer
Posts: 28
Registered: ‎03-25-2017

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Hi,

Yes as you said SS is a micro batch processing,we cannot say SS as actual
real time processing. Just for an example, if I am hosting an application
and millions of users loging and I have to trace the intruders or some un
authorized activity on the go and stop it. We cannot use rely on batch or
even micro-batch processing. It should be complete per event real time
processing. So, is there any component provided by cloudera which can do
real time processing like storm or heron?

In past I had experience, where we were having Flume running in production
smoothly to store raw protobuf data into hbase and then process it by
running mapreduce job. As it was taking long duration to complete the job,
stakeholders decided to go with spark.
In first attempt , developers added processing logic to transform raw data
from protobuf to readable and then store it into hdfs as parquet file using
spark streaming. We applied multiple suggestions and attribute to make it
run but never able to survive the production like back pressure even after
assigning it memory 3 times more than flume.
In second attempt, transformation logic was removed and tried only to store
raw protobuf data into parquet files but still it dint able to perform like
flume and always had pending batches in queue due to which it was failing
everytime and atleast had to give up spark due to in capability of spark
handling back pressure.


Thanks
Sidharth
Cloudera Employee
Posts: 463
Registered: ‎08-11-2014

Re: Why Cloudera dont have Storm as a part of CDH release yet?

No, I think you are confusing event at a time processing with real time. SS is real time. Of course, 'real time' is also relative. If you are working in the realm of milliseconds, you need a synchronous request/reply API anyway and any streaming system is probably the wrong choice.

As I say, you can run Storm on CDH if you want. But even taking your example, I strongly doubt Storm is the right architecture. You either need some CEP, or really need an API to call.

I'm not sure what problem you had writing a Spark app, but lots of people use it successfully for streaming. Nothing about that sounds like it was SS was the issue. How would you write protobuf files if you see one event at a time.

Flume is not a stream processing system. It is for getting logs into the cluster with modest ETL. If that worked for you, because that's what you're doing , then why not use Flume?
Explorer
Posts: 28
Registered: ‎03-25-2017

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Also I would like to know if I install storm on my existing cloudera
cluster,will I be able to monitor it,how?
Champion
Posts: 565
Registered: ‎05-16-2016

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Would you conisder Apache NIFI to push data you  can use backpressure , there is bunch of processor that comes with or you can write your own custom processor . 

we use NIFI to push data to hdfs - use spark for parsing - hive metastore  - perfom query using impala -kudu 

Champion
Posts: 565
Registered: ‎05-16-2016

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Would you conisder Apache NIFI to push data you  can use backpressure , there is bunch of processor that comes with or you can write your own custom processor . we use NIFI to push data to hdfs - use spark for parsing - hive metastore  - perfom query using impala -kudu 

Explorer
Posts: 28
Registered: ‎03-25-2017

Re: Why Cloudera dont have Storm as a part of CDH release yet?

Thanks, the requirement is like there should be gurantee of processing of
each single even at input rate of 1000+ events per second. And this could
happen that sometime we get a single event in a second and then suddenly a
increase of 1000+ events per second as well. So what would you tell
about *nifi->
kafka -> storm *combination. We cannot use spark streaming in this case
because even after implementing check pointing in spark streaming, if spark
streaming is processing a batch and then while processing, if that complete
job fails. Then those data which were under process may get lost. So, we
chose for kafka for gurantee of no data loss to occur.


Thanks
Sidharth
Announcements

Currently incubating in Cloudera Labs:

Envelope
HTrace
Ibis
Impyla
Livy
Oryx
Phoenix
Spark Runner for Beam SDK
Time Series for Spark
YCSB