Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

what is the relation and which one creates other in Spark? RDD lineage DAG,DAG Scheduler,Stages and Task

New Contributor

Hi Friends,

I am confused with the creation of RDD lineage,DAG,DAG Scheduler,Stage and Task.

Please validate my understanding

1) After we submit a job before an action is called...what ever transformation are put in the code before an action is called on RDD ..that RDD will have history of lineage..that is which is the parent RDD and what are transformation has occurred to create this RDD and its dependency..this is called lineage (logical execution plan)

2) When an action is called on RDD,the lineage will be converted into DAG(Physical execution plan).

3)DAG(Physical execution plan) will be submitted to DAG Scheduler which in turn will split the DAG into Stages

4)Each stage will have list of task

5)Each task will run in a executor (One executor will run one task on one partition?)

Also I want to understand where the catalyst optimizer and Tungsten encoder will come into plan?

Is it the responsibility of Catalyst optimizer will convert the RDD lineage into the best optimized execution plan as DAG?

Is it the responsbility of Tungsten encode will convert the Scala code into bytecode?

Please help me to understand the above



Hi @bsuren123 .


I think its all about JVM.


for converting scala code into bytecode.







Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.