Created on 04-05-2024 12:12 AM - edited on 10-11-2024 01:00 AM by satz
This guide outlines the steps for installing Apache Spark 3.x on your Cloudera cluster by leveraging Cloudera Manager (CM) and CDS3 parcels. Learn how to efficiently download, distribute, and activate the CDS3 parcel for a seamless Spark3 deployment, saving you time and effort compared to traditional methods. Additionally, the guide provides resources for troubleshooting any potential issues encountered during the installation.
Note: This article, mainly focuses on Cloudera Manager Server does have Internet access.
CDS 3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel. The Cloudera Service Descriptor(CDS) file is available in Cloudera Manager for CDP 7.1.X.
The CDS version label is constructed in v.v.v.w.w.xxxx.y-z...z format and carries the following information:
v.v.v - Apache Spark upstream version, for example,
Spark3 Base Parcel Location is https://archive.cloudera.com/p/spark3
CDP Version CDS3 Version Spark Version Parcel Repository CSD Installation Required?
7.1.9 | CDS 3.3 | 3.3.2.3.3.7190.0-91 | https://archive.cloudera.com/p/spark3/3.3.7190.0/parcels/ | No |
7.1.8 | CDS 3.3 | 3.3.0.3.3.7180.0-274 | https://archive.cloudera.com/p/spark3/3.3.7180.0/parcels/ | No |
7.1.7 SP2 | CDS 3.2.3 | 3.2.3.3.2.7172000.0-334 | https://archive.cloudera.com/p/spark3/3.2.7172000.0/parcels/ | Yes |
7.1.7 SP1 | CDS 3.2.3 | 3.2.1.3.2.7171000.0-3 | https://archive.cloudera.com/p/spark3/3.2.7171000.0/parcels/ | Yes |
Note(s): Ensure you install the latest Parcel version because frequently parcel versions are updated.
CDP Private Cloud Base cluster with version 7.1.7 and above
Prepare your Cloudera Manager server and cluster nodes with internet access for downloading necessary dependencies.
Based on the Spark version, you need to install the required Java version and Python version.
The CDS 3 parcel consists of two components:
Installation depends on your CDP version:
cd /opt/cloudera/csd
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/SPARK3_ON_YARN-<spark3_csd_version>.jar
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/LIVY_FOR_SPARK3-<livy3_csd_version>.jar
chown cloudera-scm:cloudera-scm *
chmod 644 *
After changing ownership you can see similar output:
-rw-r--r-- 1 cloudera-scm cloudera-scm 17216 Feb 10 2023 LIVY_FOR_SPARK3-0.6.3000.3.2.7172000.0-334.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 20227 Feb 10 2023 SPARK3_ON_YARN-3.2.3.3.2.7172000.0-334.jar
systemctl restart cloudera-scm-server
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Spark 3 service from the list of services.
Verify that the Spark 3 service is started and healthy.
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Livy for Spark 3 service from the list of services.
Verify that the Livy for Spark 3 service is started and healthy.
You can use the following sample Spark Pi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.
spark3-submit \
--master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10
You will see a similar output in the console.
Pi is roughly 3.142279142279142
spark3-submit \
--master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10
Python installation is mandatory for running any PySpark application. Before launching a PySpark application, ensure Python is installed and configured within the Spark environment. Python installation is typically required on each node where the PySpark application executes.
While some operating systems come pre-installed with Python, others do not. It's crucial to verify that a Spark-supported Python version(s) are installed on each node with a consistent location on your cluster.
The following step(s) can be skipped if you've already installed a Spark-supported Python version that's compatible with your operating system.
Specify the Python binary to be used by the Spark driver and executors by setting the PYSPARK_PYTHON environment variable in spark-env.sh. We can also override the driver Python binary path individually using the PYSPARK_DRIVER_PYTHON environment variable. These settings apply regardless of whether you are using yarn client or cluster mode.
Make sure to set the variables using the export statement. For example:
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-<path_to_python_executable>}
If you are using yarn cluster mode, in addition to the above, set spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON in spark-defaults.conf (using the safety valve) to the same paths.
The following steps assume you have installed a Python version compatible with your Spark installation.
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/python3}
export PYSPARK_DRIVER_PYTHON=${PYSPARK_DRIVER_PYTHON:-/usr/bin/python3}
NOTE: Use your python3 location, for example /usr/bin/python3.
spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3
NOTE: Use your python3 location, for example /usr/bin/python3.
You can use the following sample PySpark SparkPi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.
spark3-submit \
--master yarn \
--deploy-mode client \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py
You will see a similar output in the console.
Pi is roughly 3.132920
spark3-submit \
--master yarn \
--deploy-mode cluster \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py