<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: streaming ingest to hdfs in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148891#M52758</link>
    <description>&lt;P&gt;In Storm one needs to use storm-hdfs bolt to store data in HDFS.&lt;/P&gt;&lt;P&gt;The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).&lt;/P&gt;&lt;P&gt;The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created. &lt;/P&gt;&lt;P&gt;This could be either done based on file size, time or custom logic.&lt;/P&gt;&lt;P&gt;The full range of options is described here: &lt;A href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-apis.html" target="_blank"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-apis.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 27 Jan 2017 21:57:43 GMT</pubDate>
    <dc:creator>tkiss</dc:creator>
    <dc:date>2017-01-27T21:57:43Z</dc:date>
    <item>
      <title>streaming ingest to hdfs</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148890#M52757</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have been seeing stream processing use cases where as part of streaming ingest along with HBASE, Cassandra etc. HDFS is also shown.&lt;/P&gt;&lt;P&gt;Isn't HDFS write was supposedly only with big files 64MB/128MB +. In Flume this is achieved by hdfs.rollSize configurations. So Flume manages the buffer until it becomes big, then it writes/flushes it out.&lt;/P&gt;&lt;P&gt;How does this part is taken care when writing from Spark-streaming or STORM?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Avijeet&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:58:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148890#M52757</guid>
      <dc:creator>avijeetd</dc:creator>
      <dc:date>2022-09-16T10:58:28Z</dc:date>
    </item>
    <item>
      <title>Re: streaming ingest to hdfs</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148891#M52758</link>
      <description>&lt;P&gt;In Storm one needs to use storm-hdfs bolt to store data in HDFS.&lt;/P&gt;&lt;P&gt;The bolt is could be configured to flush out the results after a given amount of tuples received (SyncPolicy).&lt;/P&gt;&lt;P&gt;The other relevant option for the bolt is the RotationPolicy which defines how/when a new file should be created. &lt;/P&gt;&lt;P&gt;This could be either done based on file size, time or custom logic.&lt;/P&gt;&lt;P&gt;The full range of options is described here: &lt;A href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-apis.html" target="_blank"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_storm-component-guide/content/storm-hdfs-apis.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Jan 2017 21:57:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148891#M52758</guid>
      <dc:creator>tkiss</dc:creator>
      <dc:date>2017-01-27T21:57:43Z</dc:date>
    </item>
    <item>
      <title>Re: streaming ingest to hdfs</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148892#M52759</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/2707/tkiss.html" nodeid="2707"&gt;@Tibor Kiss&lt;/A&gt;&lt;/P&gt;&lt;P&gt;What is the kind of industry practice when it comes to writing streaming data to both HDFS and another real time store such as HBASE, Cassandra &lt;/P&gt;&lt;P&gt;Should we write to HDFS from the stream-processing layer (STORM, SPARK Streaming) &lt;/P&gt;&lt;P&gt;OR&lt;/P&gt;&lt;P&gt;Should we write it separately using a separate consumer (KAFKA) or SINK (flume)&lt;/P&gt;&lt;P&gt;Some reason I think writing from stream processing layer to HDFS doesn't sound right.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Avijeet&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2017 16:43:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148892#M52759</guid>
      <dc:creator>avijeetd</dc:creator>
      <dc:date>2017-02-03T16:43:36Z</dc:date>
    </item>
    <item>
      <title>Re: streaming ingest to hdfs</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148893#M52760</link>
      <description>&lt;P&gt;It really depends on your use-case and latency requirements. &lt;/P&gt;&lt;P&gt;If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt.&lt;/P&gt;&lt;P&gt;If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2017 21:56:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148893#M52760</guid>
      <dc:creator>tkiss</dc:creator>
      <dc:date>2017-02-03T21:56:23Z</dc:date>
    </item>
    <item>
      <title>Re: streaming ingest to hdfs</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148894#M52761</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/11016/avijeetd.html" nodeid="11016"&gt;@Avijeet Dash&lt;/A&gt;&lt;P&gt;I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.&lt;/P&gt;</description>
      <pubDate>Sat, 04 Feb 2017 00:29:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/streaming-ingest-to-hdfs/m-p/148894#M52761</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-02-04T00:29:48Z</dc:date>
    </item>
  </channel>
</rss>

