Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is Directed Acyclic Graph in Apache Spark?

What is Directed Acyclic Graph in Apache Spark?


Re: What is Directed Acyclic Graph in Apache Spark?

@Shreya Gupta

Check following link if it helps you to understand:-

Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.

Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel. When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.

Once data is read into an RDD object in Spark, a variety of operations can be performed by calling abstract Spark APIs. The two major types of operation available are:

  • Transformations: Transformations return a new, modified RDD based on the original. Several transformations are available through the Spark API, including map(), filter(), sample(), and union().
  • Actions: Actions return a value based on some computation being performed on an RDD. Some examples of actions supported by the Spark API include reduce(), count(), first(), and foreach().

Some Spark jobs will require that several actions or transformations be performed on a particular data set, making it highly desirable to hold RDDs in memory for rapid access. Spark exposes a simple API to do this - cache(). Once this API is called on an RDD, future operations called on the RDD will return in a fraction of the time they would if retrieved from disk.

Don't have an account?
Coming from Hortonworks? Activate your account here