<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Window Operations on Spark Streaming in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200781#M162799</link>
    <description>&lt;P&gt;Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions&lt;/P&gt;</description>
    <pubDate>Fri, 28 Jul 2017 22:23:26 GMT</pubDate>
    <dc:creator>Chandra</dc:creator>
    <dc:date>2017-07-28T22:23:26Z</dc:date>
    <item>
      <title>Window Operations on Spark Streaming</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200779#M162797</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I was just wondering if it is ok to perform window operations on dstreams with 1 week as window length. Please let me know if there are any major concerns.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jul 2017 23:56:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200779#M162797</guid>
      <dc:creator>Chandra</dc:creator>
      <dc:date>2017-07-27T23:56:11Z</dc:date>
    </item>
    <item>
      <title>Re: Window Operations on Spark Streaming</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200780#M162798</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11008/chandramoulimuthukumaran.html" nodeid="11008"&gt;@chandramouli muthukumaran&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Did you come across this link?&lt;/P&gt;&lt;P&gt;&lt;A href="http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-td10191.html" target="_blank"&gt;http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-td10191.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Spark stores all DStream objects in memory, so you'll obviously need RAM accordingly. &lt;/P&gt;&lt;P&gt;Also DStream rdds are replicated (2 by default) to provide fault tolerance in Spark Streaming which increases memory even more. &lt;A href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#background" target="_blank"&gt;https://spark.apache.org/docs/latest/streaming-programming-guide.html#background&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Though you may be able to disable this but you'll lose tolerance.&lt;/P&gt;&lt;P&gt;The link I mentioned should give you more idea.&lt;/P&gt;</description>
      <pubDate>Fri, 28 Jul 2017 17:01:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200780#M162798</guid>
      <dc:creator>tsharma</dc:creator>
      <dc:date>2017-07-28T17:01:09Z</dc:date>
    </item>
    <item>
      <title>Re: Window Operations on Spark Streaming</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200781#M162799</link>
      <description>&lt;P&gt;Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions&lt;/P&gt;</description>
      <pubDate>Fri, 28 Jul 2017 22:23:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200781#M162799</guid>
      <dc:creator>Chandra</dc:creator>
      <dc:date>2017-07-28T22:23:26Z</dc:date>
    </item>
    <item>
      <title>Re: Window Operations on Spark Streaming</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200782#M162800</link>
      <description>&lt;P&gt;Hi, I'm not very sure but you could use flume to get data into HDFS by using an hdfs sink.&lt;/P&gt;&lt;P&gt;&lt;A href="https://flume.apache.org/FlumeUserGuide.html" target="_blank"&gt;https://flume.apache.org/FlumeUserGuide.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The location in hdfs is mentioned in flume-agent.conf file, for example:&lt;/P&gt;&lt;P&gt;agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata&lt;/P&gt;&lt;P&gt;You could write a script to modify this directory with a timestamp and restart the flume agent. And then run that every week through cron.&lt;/P&gt;</description>
      <pubDate>Sat, 29 Jul 2017 11:23:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200782#M162800</guid>
      <dc:creator>tsharma</dc:creator>
      <dc:date>2017-07-29T11:23:06Z</dc:date>
    </item>
    <item>
      <title>Re: Window Operations on Spark Streaming</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200783#M162801</link>
      <description>&lt;P&gt;Doesn't seem like streaming data directly to HDFS will make it very easy to find/aggregate at the end of each window? What about creating a key/value store (with reddis, hbase, or elasticSearch for example) and using it to lookup all the keys associated with each window.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Aug 2017 12:29:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Window-Operations-on-Spark-Streaming/m-p/200783#M162801</guid>
      <dc:creator>rhardaway</dc:creator>
      <dc:date>2017-08-02T12:29:06Z</dc:date>
    </item>
  </channel>
</rss>

