Support Questions

Find answers, ask questions, and share your expertise

streaming ingest to hdfs

avatar
Super Collaborator

Hi,

I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown.

Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out.

How does this part is taken care when writing from Spark-streaming or STORM?

Thanks,

Avijeet

1 ACCEPTED SOLUTION

avatar
Expert Contributor

In Storm one needs to use storm-hdfs bolt to store data in HDFS.

The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).

The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.

This could be either done based on file size, time or custom logic.

The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

In Storm one needs to use storm-hdfs bolt to store data in HDFS.

The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).

The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.

This could be either done based on file size, time or custom logic.

The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...

avatar
Super Collaborator

Thanks @Tibor Kiss

What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra

Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming)

OR

Should we write it separately using a separate consumer (KAFKA) or SINK (flume)

Some reason I think writing from stream processing layer to HDFS doesn't sound right.

Thanks,

Avijeet

avatar
Expert Contributor

It really depends on your use-case and latency requirements.

If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt.

If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.

avatar
Super Guru
@Avijeet Dash

I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.