Support Questions
Find answers, ask questions, and share your expertise

spark streaming

In spark streaming, how to load data received in different events into collection. Lets say on event 1, 1000 records are streamed and on event 2 again another 1000 records are streamed. Now at the end of event 2 I want both the event data (1000 + 1000)

I tried accumulable collection in spark to accumulate data streamed in different events. But, it did not work. Please help.

2 REPLIES 2

Hi @Balakumar Balasundaram

Sounds like you want to join two DStreams? See this tutorial for a good example: https://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/13%20Joining%2...

Explorer

Spark Streaming provides windowed computations, which allow you to apply transformations over a sliding window of data. Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. So in your case you can define the window over event 1 and event 2 and you then have a combined RDD which you can save or work on.

Here is sample command in Scala taken from the Spark Streaming Programming Guide / Windows Operations.

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))