Created 08-24-2017 04:17 AM
Created 08-24-2017 05:02 AM
Check following link if it helps you to understand:-
http://data-flair.training/blogs/dag-in-apache-spark/
Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.
Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel. When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.
Once data is read into an RDD object in Spark, a variety of operations can be performed by calling abstract Spark APIs. The two major types of operation available are:
Some Spark jobs will require that several actions or transformations be performed on a particular data set, making it highly desirable to hold RDDs in memory for rapid access. Spark exposes a simple API to do this - cache(). Once this API is called on an RDD, future operations called on the RDD will return in a fraction of the time they would if retrieved from disk.
Created 04-24-2021 08:58 AM
Thank you, clear and solid explanation.