Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to spark streaming application start working on next batch before completing on previous batch

how to spark streaming application start working on next batch before completing on previous batch

I am using spark streaming with Kafka. spark version is 1.5.0 , cloudera version is 5.5.0and my batch interval is 1 sec.

 

In my scenario , algorithm take 7-10 sec to process 1 batch period data. so after completing previous batch , spark streaming application start processing on next batch.

 

I want that my spark streaming application start working on next batch before completing on previous batch . means batches will execute in parallel.

 

please help me to solve this problem.

 

Regards Prateek

1 REPLY 1
Highlighted

Re: how to spark streaming application start working on next batch before completing on previous bat

Master Guru

You can indeed pass work onto a different (async) thread from your streaming batch RDDs. However, make sure to use the StreamingContext.remember(…) method to ensure the DStream keeps the data around for at least as long as it takes for your processing to complete over the batch.

 

Some aspects of this is discussed also in Spark's streaming programming guide (discussed within an SQL context, but it can be generalised to your use-case in the same way):

 

"""
You can also run SQL queries on tables defined on streaming data from a different thread (that is, asynchronous to the running StreamingContext). Just make sure that you set the StreamingContext to remember a sufficient amount of streaming data such that the query can run. Otherwise the StreamingContext, which is unaware of the any asynchronous SQL queries, will delete off old streaming data before the query can complete. For example, if you want to query the last batch, but your query can take 5 minutes to run, then call streamingContext.remember(Minutes(5)) (in Scala, or equivalent in other languages).

""" - https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations

Don't have an account?
Coming from Hortonworks? Activate your account here