Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine. We can do ETL analysis, Machine learning, Graph processing on large volumes of data at rest as well as data in motion with rich consise high level APIs in programming languages such as Python, Java, Scala, R and SQL. Spark execution engine is capable of processing batch data and real time streaming data from different data sources via Spark streaming. Spark supports an interactive shell for performing data exploration and for ad hoc analysis. We can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application.
spark-shell - scala interactive shell
pyspark - python interactive shell
sparkR - R interactive shell
To run applications distributed across a cluster, Spark requires a cluster manager. Spark environment can be deployed and managed through the following cluster managers.
Spark standalone is a simplest way to deploy Spark on a private cluster. Here, Spark application processes are managed by Spark Master and Worker nodes.
When Spark applications run on a YARN cluster manager, Spark application processes are managed by the YARN ResourceManager and NodeManager. YARN controls resource management, scheduling and security of submitted applications.
Besides Spark on YARN, Spark cluster can be managed by Mesos(A general cluster manager that can also run Hadoop MapReduce and service applications) and Kubernetes(an open-source system for automating deployment, scaling, and management of containerized applications). The Kubernetes scheduler is currently experimental.
How Spark works in Cluster mode?
Spark orchestrates its operations through the driver program. A driver program is the process running the main() function of the application and creating the SparkContext. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process our data. The driver program must be network addressable from the worker nodes. An executor is a process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Each driver and executor process in a cluster is a different JVM.
The following occurs when we submit a Spark application to a cluster:
The driver is launched and invokes the main method in the Spark application.
The driver requests resources from the cluster manager to launch executors.
The cluster manager launches executors on behalf of the driver program.
The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks
Tasks are run on executors to compute and save results.
If dynamic allocation is enabled, after executors are idle for a specified period, they are released.
When driver's main method exits or calls SparkContext.stop , it terminates any outstanding executors and releases resources from the cluster manager.
Let us focus on running spark applications on YARN.
Advantages of spark on YARN than other cluster managers:
We can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.
We can use all the features of YARN schedulers(Capacity Scheduler, Fair Scheduler or FIFO Scheduler) for categorizing, isolating, and prioritizing workloads.
We can choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.
Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes.
Deployment Modes in YARN:
In YARN, each application instance has an ApplicationMaster process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager. Once the resources are allocated, the application instructs NodeManagers to start containers on its behalf. ApplicationMasters eliminate the need for an active client: the process starting the application can terminate, and coordination continues from a process managed by YARN running on the cluster.
Cluster Deployment Mode:
In cluster mode, the Spark driver runs in the ApplicationMaster on a cluster host. A single process in a YARN container
is responsible for both driving the application and requesting resources from YARN. The client that launches the
application does not need to run for the lifetime of the application.
Cluster mode is not well suited to using Spark interactively. Spark applications that require user input, such as
spark-shell and pyspark , require the Spark driver to run inside the client process that initiates the Spark application.
Client Deployment Mode:
In client mode, the Spark driver runs on the host where the job is submitted. The ApplicationMaster is responsible only
for requesting executor containers from YARN. After the containers start, the client communicates with the containers
to schedule work.
Why Java and Scala is preferred for spark application development?
Accessing Spark with Java and Scala offers many advantages:
platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance because Spark itself runs in the JVM. You lose these advantages when using the Spark Python API.
Managing dependencies and making them available for Python jobs on a cluster can be difficult.
To determine which dependencies are required on the cluster, we must understand that Spark code applications run in Spark executor processes distributed throughout the cluster.
If the Python transformations we define use any third-party libraries, such as NumPy or nltk, Spark executors require access to those libraries when they run on remote executors. It is better to implement a shared python lib path in worker modes using nfs to access the third-party libraries.
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. We can apply Spark’s machine learning and graph processing algorithms on data streams. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Spark SQL is a Spark module for structured data processing. Spark SQL supports executing SQL queries on a variety of data sources such as relaional databases and hive through a DataFrame interface. A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.
Machine Learning Library(MLib):
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. MLib provides tools and utilities to load ML algorithms such as classification, regression, clustering, and collaborative filtering on large volume of datasets. It supports operations such as feature extraction, transformation, dimensionality reduction, and selection. It has pipelining tools for constructing, evaluating and tuning ML pipelines. We can save and load algorithms, models and pipelines. Spark MLib has utilities to work with linear algebra, statistics, data handling, etc.
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
Let us see how to develop a streaming application using apache spark streaming module.