question Re: streaming ingest to hdfs in Archives of Support Questions (Read Only)

streaming ingest to hdfs

avijeetd — Fri, 16 Sep 2022 10:58:28 GMT

Hi,

I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown.

Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out.

How does this part is taken care when writing from Spark-streaming or STORM?

Thanks,

Avijeet

Re: streaming ingest to hdfs

tkiss — Fri, 27 Jan 2017 21:57:43 GMT

In Storm one needs to use storm-hdfs bolt to store data in HDFS.

The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).

The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.

This could be either done based on file size, time or custom logic.

The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-apis.html

Re: streaming ingest to hdfs

avijeetd — Fri, 03 Feb 2017 16:43:36 GMT

Thanks @Tibor Kiss

What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra

Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming)

Should we write it separately using a separate consumer (KAFKA) or SINK (flume)

Some reason I think writing from stream processing layer to HDFS doesn't sound right.

Thanks,

Avijeet

Re: streaming ingest to hdfs

tkiss — Fri, 03 Feb 2017 21:56:23 GMT

It really depends on your use-case and latency requirements.

If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt.

If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.

Re: streaming ingest to hdfs

mqureshi — Sat, 04 Feb 2017 00:29:48 GMT

@Avijeet Dash

I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.