Created on 01-27-2017 10:00 AM - edited 09-16-2022 03:58 AM
Hi,
I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown.
Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out.
How does this part is taken care when writing from Spark-streaming or STORM?
Thanks,
Avijeet
Created 01-27-2017 01:57 PM
In Storm one needs to use storm-hdfs bolt to store data in HDFS.
The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).
The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.
This could be either done based on file size, time or custom logic.
The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...
Created 01-27-2017 01:57 PM
In Storm one needs to use storm-hdfs bolt to store data in HDFS.
The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).
The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.
This could be either done based on file size, time or custom logic.
The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...
Created 02-03-2017 08:43 AM
Thanks @Tibor Kiss
What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra
Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming)
OR
Should we write it separately using a separate consumer (KAFKA) or SINK (flume)
Some reason I think writing from stream processing layer to HDFS doesn't sound right.
Thanks,
Avijeet
Created 02-03-2017 01:56 PM
It really depends on your use-case and latency requirements.
If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt.
If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.
Created 02-03-2017 04:29 PM
I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.