Community Articles

Find and share helpful community-sourced technical articles.
avatar
Rising Star

spark-connect-slide.png

JupyterLab is a powerful, web-based interactive development environment widely used for data science, machine learning, and Spark application development. It extends the classic Jupyter Notebook interface, offering a more flexible and integrated workspace.

For Spark application development, JupyterLab provides the following advantages:

1. Interactive Exploration: Run Spark queries and visualize results interactively, making it ideal for data exploration and debugging.

2. Rich Visualization: Seamlessly integrate with Python visualization libraries like Matplotlib, Seaborn, and Plotly to analyze and interpret Spark data.

3. Ease of Integration: Use PySpark or Sparkmagic to connect JupyterLab with Spark clusters, enabling distributed data processing from the notebook environment.

Spark Connect is a feature introduced in Apache Spark that provides a standardized, client-server architecture for connecting to Spark clusters. It decouples the client from the Spark runtime, allowing users to interact with Spark through lightweight, language-specific clients without the need to run a full Spark environment on the client side.

With Spark Connect, users can:

1. Access Spark Remotely: Connect to a Spark cluster from various environments, including local machines or web applications.

2. Support Multiple Languages: Use Spark with Python, Scala, Java, SQL, and other languages through dedicated APIs.

3. Simplify Development: Develop and test Spark applications without needing a full Spark installation, making it easier for developers and data scientists to work with distributed data processing.

This architecture enhances usability, scalability, and flexibility, making Spark more accessible to a wider range of users and environments.

In this article you will use JupyterLab locally to interactively prototype a PySpark and Iceberg application in a dedicated Spark Virtual Cluster running in Cloudera Data Engineering on AWS.

Prerequisites
  • A CDE Service and Virtual Cluster on version 1.23 or above, and 3.5.1, respectively.
  • A local installation of the CDE CLI on version 1.23 or above.
  • A local installation of JupyterLab. Version 4.0.7 was used for this demonstration but other versions should work as well.
  • A local installation of Python. Version 3.9.12 was used for this demonstration but other versions will work as well.
1. Launch a CDE Spark Connect Session

Create CDE Files Resources and upload csv files.

 

cde resource create \
  --name telcoFiles \
  --type files

cde resource upload \
  --name telcoFiles \
  --local-path resources/cell_towers_1.csv \
  --local-path resources/cell_towers_2.csv

 

Start a CDE Session of type Spark Connect. Edit the Session Name parameter so it doesn't collide with other users' sessions.

 

cde session create \
  --name spark-connect-session-res \
  --type spark-connect \
  --num-executors 2 \
  --driver-cores 2 \
  --driver-memory "2g" \
  --executor-cores 2 \
  --executor-memory "2g" \
  --mount-1-resource telcoFiles

 

In the Sessions UI, validate the Session is Running.

cde_session_validate_1.png

cde_session_validate_2.png

2. Install Spark Connect Prerequisites

From the terminal, install the following Spark Connect prerequisites:

  • Download the cdeconnect and PySpark packages from the CDE Session Configuration tab and place them in your project home folder:

cde_spark_connect_download_deps.png

cde_spark_connect_project_home.png

  • Create a new Python Virtual Environment:

 

python -m venv spark_connect_jupyter
source spark_connect_jupyter/bin/activate

 

  • Install the following packages. Notice that these exact versions were used with Python 3.9. Numpy, cmake, and PyArrow versions may be subject to change depending on your Python version.

 

pip install numpy==1.26.4
pip install --upgrade cmake
pip install pyarrow==14.0.0
pip install cdeconnect.tar.gz  
pip install pyspark-3.5.1.tar.gz

 

  • Launch the JupyterLab server with:

 

pip install jupyterlab
jupyter lab

 

launch_cde_spark_connect_jupyter.png

3. Run Your First PySpark & Iceberg Application via Spark Connect

You are now ready to connect to the CDE Session from your local JupyterLab instance using Spark Connect.

  • In the first cell, edit the sessionName option and add your session name from the CLI Create Session command above.
  • In the second cell, edit your username.

Now run each cell and observe outputs.

cde_spark_connect_notebook_outputs.png

 

281 Views
0 Kudos