Support Questions

arsalan_siddiqi · ‎06-20-2017

Hi

Is there a way to determine how many jobs will eventually be created against a batch in spark Streaming. Spark captures all the events within a window called batch interval. Apart from this we also have a block interval which divides the batch data into blocks.

Example: batch interval 5 seconds

Block Interval: 1 second

This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks.

How can I find the number of Jobs that will be there in a batch?

In a spark application you have:

Jobs which consists of a number of Sequential Stages and each stage consists of a number of Tasks (mentioned above).

mgaido1 · ‎06-21-2017

The number of jobs depends on the number of actions of your application. Spark has two kinds of operations: actions and transformations. Every action triggers a job. The number of actions (and therefore the number of jobs) of course depends on your application logic and your implementation.

In a streaming application all the actions are performed on all the batches. Thus, in the easiest case you will have one job per batch. This happens for instance if you read your data, transform it (with map or filter or similar methods) and write it (for instance to HDFS).

When you say:

This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks.

You're not really right. This is the easiest case, i.e. one job per batch and one stage per job. You should have said that therefore you have a total of 5 partitions. So, the total number of tasks depends on the number of jobs, the number of stages and the number of partitions. The number of partitions is the easiest one to know. The number of jobs depends on the number of actions your code does.

View solution in original post

mgaido1 · ‎06-21-2017

The number of jobs depends on the number of actions of your application. Spark has two kinds of operations: actions and transformations. Every action triggers a job. The number of actions (and therefore the number of jobs) of course depends on your application logic and your implementation.

In a streaming application all the actions are performed on all the batches. Thus, in the easiest case you will have one job per batch. This happens for instance if you read your data, transform it (with map or filter or similar methods) and write it (for instance to HDFS).

When you say:

This means you will have 5 blocks in a batch. Each block is processed by a task. Therefore you will have a total of 5 tasks.

You're not really right. This is the easiest case, i.e. one job per batch and one stage per job. You should have said that therefore you have a total of 5 partitions. So, the total number of tasks depends on the number of jobs, the number of stages and the number of partitions. The number of partitions is the easiest one to know. The number of jobs depends on the number of actions your code does.

arsalan_siddiqi · ‎06-21-2017

@Marco Gaido

Thanks for your answer. I also found the answer on slide 23 here: Deep dive with Spark Streaming.

I do agree , you can get the number of blocks which represent the partitions.

total tasks per job = number of stages in the job * number of partitions

I was also wondering what happens when the data rate varies considerably, will we have uneven blocks? meaning that tasks will have uneven workload?

I am a bit confused now. From simple spark used for batch processing Spark Architecture you have jobs where each job has stages and stages have tasks. Here whenever you have an action a new stage is created. Therefore in such a case the number of stages will depend on the number of actions.

A stage contains all transformations until an action is performed (or output).

In case of spark streaming, we have one job per action. How many stages will be there if you have a separate job when you perform an action?

Thanks

mgaido1 · ‎06-21-2017

There is no difference between Spark and Spark streaming in terms of stages, jobs and tasks management. In both cases, you have one job per action. In both cases, as you correctly stated, jobs are made of stages. And stages are made of tasks. You have one task per partition of your RDD. The number of stages depends on the number of wide dependency you encounter in the lineage to perform a given action. And you have a job per action. The only difference is that in Spark streaming everything is repeated for each (mini-)batch.

Cloudera Community

Support Questions

Number of Jobs in Spark Streaming