Support Questions
Find answers, ask questions, and share your expertise

Window Operations on Spark Streaming

Solved Go to solution
Highlighted

Window Operations on Spark Streaming

Contributor

Hi,

I was just wondering if it is ok to perform window operations on dstreams with 1 week as window length. Please let me know if there are any major concerns.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Window Operations on Spark Streaming

Expert Contributor

@chandramouli muthukumaran

Did you come across this link?

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-...

Spark stores all DStream objects in memory, so you'll obviously need RAM accordingly.

Also DStream rdds are replicated (2 by default) to provide fault tolerance in Spark Streaming which increases memory even more. https://spark.apache.org/docs/latest/streaming-programming-guide.html#background

Though you may be able to disable this but you'll lose tolerance.

The link I mentioned should give you more idea.

View solution in original post

4 REPLIES 4
Highlighted

Re: Window Operations on Spark Streaming

Expert Contributor

@chandramouli muthukumaran

Did you come across this link?

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-...

Spark stores all DStream objects in memory, so you'll obviously need RAM accordingly.

Also DStream rdds are replicated (2 by default) to provide fault tolerance in Spark Streaming which increases memory even more. https://spark.apache.org/docs/latest/streaming-programming-guide.html#background

Though you may be able to disable this but you'll lose tolerance.

The link I mentioned should give you more idea.

View solution in original post

Highlighted

Re: Window Operations on Spark Streaming

Contributor

Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions

Highlighted

Re: Window Operations on Spark Streaming

Expert Contributor

Hi, I'm not very sure but you could use flume to get data into HDFS by using an hdfs sink.

https://flume.apache.org/FlumeUserGuide.html

The location in hdfs is mentioned in flume-agent.conf file, for example:

agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata

You could write a script to modify this directory with a timestamp and restart the flume agent. And then run that every week through cron.

Re: Window Operations on Spark Streaming

Explorer

Doesn't seem like streaming data directly to HDFS will make it very easy to find/aggregate at the end of each window? What about creating a key/value store (with reddis, hbase, or elasticSearch for example) and using it to lookup all the keys associated with each window.