Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Number of Jobs in Spark Streaming

avatar
Expert Contributor

Hi

Is there a way to determine how many jobs will eventually be created against a batch in spark Streaming. Spark captures all the events within a window called batch interval. Apart from this we also have a block interval which divides the batch data into blocks.

Example: batch interval 5 seconds

Block Interval: 1 second

This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks.

How can I find the number of Jobs that will be there in a batch?

In a spark application you have:

Jobs which consists of a number of Sequential Stages and each stage consists of a number of Tasks (mentioned above).

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor

@Marco Gaido

Thanks for your answer. I also found the answer on slide 23 here: Deep dive with Spark Streaming.

I do agree , you can get the number of blocks which represent the partitions.

total tasks per job = number of stages in the job * number of partitions

I was also wondering what happens when the data rate varies considerably, will we have uneven blocks? meaning that tasks will have uneven workload?

I am a bit confused now. From simple spark used for batch processing Spark Architecture you have jobs where each job has stages and stages have tasks. Here whenever you have an action a new stage is created. Therefore in such a case the number of stages will depend on the number of actions.

A stage contains all transformations until an action is performed (or output).

In case of spark streaming, we have one job per action. How many stages will be there if you have a separate job when you perform an action?

Thanks

avatar
Expert Contributor

There is no difference between Spark and Spark streaming in terms of stages, jobs and tasks management. In both cases, you have one job per action. In both cases, as you correctly stated, jobs are made of stages. And stages are made of tasks. You have one task per partition of your RDD. The number of stages depends on the number of wide dependency you encounter in the lineage to perform a given action. And you have a job per action. The only difference is that in Spark streaming everything is repeated for each (mini-)batch.