Created on 08-30-201803:42 PM - edited 08-17-201906:34 AM
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.
The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.
1. Spark
is a library and not a service. 2. Spark interacts with multiple services
like HDFS and YARN to process data. 3. Spark client also has YARN client wrapped
within it. 4. Spark can be configured to run both
locally and on the cluster 5. Spark context is the entry point to
interact with a Spark process. 6. Spark is a JVM based execution engine.
Take Away 1. Each line of your code is parsed to prepare a spark Plan. 2. sc.TextFile => results in fetching the metaInfo from Name Node of where are the file Blocks are located and requesting YARN for containers on those host. Text File also provides the information about the record delimeter used ( new line character in case of Text). 3. The transformations are all grouped together in a Task. The transformation are serialized on driver and send to the executors. Do appreciate all transformation and object creation happens on Driver and subsequently sent to executors. 4. Reduce by results in Data Shuffling know as Stages in Spark. 5. saveAsTextFile interacts with Name Node to get information about where to save the file and saves the file on HDFS.