- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
streaming ingest to hdfs
Created on ‎01-27-2017 10:00 AM - edited ‎09-16-2022 03:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown.
Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out.
How does this part is taken care when writing from Spark-streaming or STORM?
Thanks,
Avijeet
Created ‎01-27-2017 01:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Storm one needs to use storm-hdfs bolt to store data in HDFS.
The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).
The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.
This could be either done based on file size, time or custom logic.
The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...
Created ‎01-27-2017 01:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Storm one needs to use storm-hdfs bolt to store data in HDFS.
The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).
The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created.
This could be either done based on file size, time or custom logic.
The full range of options is described here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-...
Created ‎02-03-2017 08:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Tibor Kiss
What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra
Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming)
OR
Should we write it separately using a separate consumer (KAFKA) or SINK (flume)
Some reason I think writing from stream processing layer to HDFS doesn't sound right.
Thanks,
Avijeet
Created ‎02-03-2017 01:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It really depends on your use-case and latency requirements.
If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt.
If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.
Created ‎02-03-2017 04:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.
