Community Articles

satz · ‎04-05-2024

Installing Spark3 and Livy3 on Cloudera Manager with CDS3 Parcel

Apache Spark Logo

1. Introduction

This guide outlines the steps for installing Apache Spark 3.x on your Cloudera cluster by leveraging Cloudera Manager (CM) and CDS3 parcels. Learn how to efficiently download, distribute, and activate the CDS3 parcel for a seamless Spark3 deployment, saving you time and effort compared to traditional methods. Additionally, the guide provides resources for troubleshooting any potential issues encountered during the installation.

Note: This article, mainly focuses on Cloudera Manager Server does have Internet access.

2. CDP-compatible CDS3 Parcels Details

CDS 3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel. The Cloudera Service Descriptor(CDS) file is available in Cloudera Manager for CDP 7.1.X.

The CDS version label is constructed in v.v.v.w.w.xxxx.y-z...z format and carries the following information:

v.v.v - Apache Spark upstream version, for example,

3.3.2 w.w - Cloudera internal version number,
3.3 xxxx - CDP version number, 7190 (referring to CDP Private Cloud Base 7.1.9) y - maintenance version,
0 z...z - build number, for example 91

Spark3 Base Parcel Location is https://archive.cloudera.com/p/spark3

CDP Version CDS3 Version Spark Version Parcel Repository CSD Installation Required?

7.1.9	CDS 3.3	3.3.2.3.3.7190.0-91	https://archive.cloudera.com/p/spark3/3.3.7190.0/parcels/	No
7.1.8	CDS 3.3	3.3.0.3.3.7180.0-274	https://archive.cloudera.com/p/spark3/3.3.7180.0/parcels/	No
7.1.7 SP2	CDS 3.2.3	3.2.3.3.2.7172000.0-334	https://archive.cloudera.com/p/spark3/3.2.7172000.0/parcels/	Yes
7.1.7 SP1	CDS 3.2.3	3.2.1.3.2.7171000.0-3	https://archive.cloudera.com/p/spark3/3.2.7171000.0/parcels/	Yes

Note(s): Ensure you install the latest Parcel version because frequently parcel versions are updated.

3. Prerequisites

CDP Private Cloud Base cluster with version 7.1.7 and above
Prepare your Cloudera Manager server and cluster nodes with internet access for downloading necessary dependencies.
Based on the Spark version, you need to install the required Java version and Python version.
Spark Shuffle port needs to be opened on Firewall if the hosts are using Firewall restrictions. The Shuffle port is configurable, and it defaults to 7447

4. Installation Steps

The CDS 3 parcel consists of two components:

Custom Service Descriptor (CSD) file: A CSD file defines the configuration for managing a new service and it is typically provided as a JAR file.
Parcel file: A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager.

Installation depends on your CDP version:

CDP versions before 7.1.8: You need to install the CSD file(s) and Parcel file separately.
CDP versions 7.1.8 and later: For CDP versions 7.1.8 and above, Spark3 and Livy for Spark3 CSD files are included directly within Cloudera Manager. Therefore, there's no need for separate external CSD files for these components.

Step 1: Install CSD (Custom Service Descriptor) files. (Required for CDP version 7.1.7 only)

Log on to the Cloudera Manager Server host and go to the CDS Powered by Apache Spark service descriptor in the location configured for service descriptor files. By default, the CSD location is /opt/cloudera/csd.

cd /opt/cloudera/csd

Download the CDS 3.2.3 service descriptor files.

Note: You need to replace the following values before running the wget command:
1. Replace the `username` and `password`.
2. Replace the `csd_cdp_version`. For example `3.2.7172000.0`.
3. Replace the `spark3_csd_version`. For example `3.2.3.3.2.7172000.0-334`.
4. Replace the `livy3_csd_version`. For example `0.6.3000.3.2.7172000.0-334`.

wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/SPARK3_ON_YARN-<spark3_csd_version>.jar
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/LIVY_FOR_SPARK3-<livy3_csd_version>.jar

Set the file ownership of the service descriptor to cloudera-scm:cloudera-scm with permission 644.

chown cloudera-scm:cloudera-scm *
chmod 644 *

After changing ownership you can see similar output:

-rw-r--r-- 1 cloudera-scm cloudera-scm 17216 Feb 10  2023 LIVY_FOR_SPARK3-0.6.3000.3.2.7172000.0-334.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 20227 Feb 10  2023 SPARK3_ON_YARN-3.2.3.3.2.7172000.0-334.jar

Restart the Cloudera Manager Server with the following command:

systemctl restart cloudera-scm-server

Step 2. Add the CDS Parcel Repository

Log in to the Cloudera Manager Admin Console and Click Parcels from the left menu.
Click Parcel Repositories & Network Settings.
In the Remote Parcel Repository URLs section, click the + icon.
Enter the CDS3 parcel repository URL provided by Cloudera (See 2. CDP compatible CDS3 Parcels Details section)
Click Save & Verify Configuration. A message with the status of the verification appears above the Remote Parcel Repository URLs section. If the URL is not valid, check the URL and enter the correct URL.
After the URL is verified, click Close.
Locate the row in the table that contains the new Cloudera Runtime parcel i.e. SPARK3 and click the Download button.
After the SPARK3 parcel is downloaded, click the Distribute button to distribute the parcel to all the cluster nodes. Wait for the parcel to be distributed. Cloudera Manager displays the status of the Cloudera Runtime parcel distribution. By default, Spark3 parcel will be downloaded and distributed to /opt/cloudera/parcels/ location.
After the SPARK3 parcel is distributed, click the Activate button to activate the parcel on all the cluster nodes.
When prompted, click on OK.
Now, you can see SPARK3 parcel Status is Distributed, Activated
A symbolic link named SPARK3 has now been created in the /opt/cloudera/parcels/ directory.

Step 3. Add Spark3 Service

Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Click on Actions or More Options (ellipsis icon), then click Add Service.
Select Spark 3, then click Continue.
Based on your requirements you can select any Optional Dependencies, such as Atlas, HBase, Kafka, Knox, or select No Optional Dependencies, then click Continue.
Select Assign Roles by selecting the host where the History Server is to be added. You can add a gateway role to every host.
Review Changes then click Continue.
On Command Details, select run options, confirm success, then click Continue.
On Summary, click Finish.
The Spark 3 service appears in the Cloudera Manager cluster components list. If Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Spark 3 service.
Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services.
Restart all services with stale configurations.

Step 4. Verify Spark3 Installation

Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Spark 3 service from the list of services.
Verify that the Spark 3 service is started and healthy.

Step 5. Add Livy3 Service

Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> click on Actions or More Options (ellipsis icon), then click Add Service.
Select Livy for Spark 3, then click Continue.
Based on your requirements you can select any Optional Dependencies, such as Hive or select No Optional Dependencies, then click Continue.
Select Assign Roles by selecting the host where Livy Server for Spark 3 is to be added. You can also add the Gateway (optional) but recommended.
Review Changes then click Continue.
On Command Details, select run options, confirm success, then click Continue.
On Summary, click Finish.
The Livy for Spark 3 service appears in the Cloudera Manager cluster components list. If Livy for Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Livy for Spark 3 service.
Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services.
Restart all services with stale configurations.

Step 6. Verify Livy3 Installation

Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Livy for Spark 3 service from the list of services.
Verify that the Livy for Spark 3 service is started and healthy.

5. Running SparkPi example using spark-examples.jar file

You can use the following sample Spark Pi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.

Running SparkPi in YARN Client Mode:

spark3-submit \
	--master yarn \
	--deploy-mode client \
	--class org.apache.spark.examples.SparkPi \
	/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10

You will see a similar output in the console.

Pi is roughly 3.142279142279142

Running SparkPi in YARN Cluster Mode:

spark3-submit \
	--master yarn \
	--deploy-mode cluster \
	--class org.apache.spark.examples.SparkPi \
	/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10

6. Setting Up and Configuring PySpark

Python installation is mandatory for running any PySpark application. Before launching a PySpark application, ensure Python is installed and configured within the Spark environment. Python installation is typically required on each node where the PySpark application executes.

While some operating systems come pre-installed with Python, others do not. It's crucial to verify that a Spark-supported Python version(s) are installed on each node with a consistent location on your cluster.

The following step(s) can be skipped if you've already installed a Spark-supported Python version that's compatible with your operating system.

Custom Python library Configuration

Specify the Python binary to be used by the Spark driver and executors by setting the PYSPARK_PYTHON environment variable in spark-env.sh. We can also override the driver Python binary path individually using the PYSPARK_DRIVER_PYTHON environment variable. These settings apply regardless of whether you are using yarn client or cluster mode.

Make sure to set the variables using the export statement. For example:

export PYSPARK_PYTHON=${PYSPARK_PYTHON:-<path_to_python_executable>}

Here are some example Python binary paths:

Anaconda parcel: /opt/cloudera/parcels/Anaconda3/bin/python
Virtual environment: /path/to/virtualenv/bin/python

If you are using yarn cluster mode, in addition to the above, set spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON in spark-defaults.conf (using the safety valve) to the same paths.

Setting the Custom Python Path steps

The following steps assume you have installed a Python version compatible with your Spark installation.

Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Go to the Spark 3 service -> Click the Configuration tab.
Search for Spark 3 Service Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-env.sh --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3.

export PYSPARK_PYTHON=${PYSPARK_PYTHON‌:-/usr/bin/python3}
export PYSPARK_DRIVER_PYTHON=${PYSPARK_DRIVER_PYTHON‌:-/usr/bin/python3}

NOTE: Use your python3 location, for example /usr/bin/python3.

Search for Spark 3 Client Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-defaults.conf --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3.

spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3

NOTE: Use your python3 location, for example /usr/bin/python3.

Enter a Reason for change, and click Save Changes to commit the changes.
Restart the Spark 3 service.
Deploy the client configuration.

7. Running PySpark SparkPi Example using pi.py file

You can use the following sample PySpark SparkPi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.

1) Running Pyspark SparkPi in YARN Client Mode:

spark3-submit \
	--master yarn \
	--deploy-mode client \
	/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py