Created on 01-13-2025 03:48 PM - edited 01-13-2025 03:49 PM
JupyterLab is a powerful, web-based interactive development environment widely used for data science, machine learning, and Spark application development. It extends the classic Jupyter Notebook interface, offering a more flexible and integrated workspace.
For Spark application development, JupyterLab provides the following advantages:
1. Interactive Exploration: Run Spark queries and visualize results interactively, making it ideal for data exploration and debugging.
2. Rich Visualization: Seamlessly integrate with Python visualization libraries like Matplotlib, Seaborn, and Plotly to analyze and interpret Spark data.
3. Ease of Integration: Use PySpark or Sparkmagic to connect JupyterLab with Spark clusters, enabling distributed data processing from the notebook environment.
Spark Connect is a feature introduced in Apache Spark that provides a standardized, client-server architecture for connecting to Spark clusters. It decouples the client from the Spark runtime, allowing users to interact with Spark through lightweight, language-specific clients without the need to run a full Spark environment on the client side.
With Spark Connect, users can:
1. Access Spark Remotely: Connect to a Spark cluster from various environments, including local machines or web applications.
2. Support Multiple Languages: Use Spark with Python, Scala, Java, SQL, and other languages through dedicated APIs.
3. Simplify Development: Develop and test Spark applications without needing a full Spark installation, making it easier for developers and data scientists to work with distributed data processing.
This architecture enhances usability, scalability, and flexibility, making Spark more accessible to a wider range of users and environments.
In this article you will use JupyterLab locally to interactively prototype a PySpark and Iceberg application in a dedicated Spark Virtual Cluster running in Cloudera Data Engineering on AWS.
Create CDE Files Resources and upload csv files.
cde resource create \
--name telcoFiles \
--type files
cde resource upload \
--name telcoFiles \
--local-path resources/cell_towers_1.csv \
--local-path resources/cell_towers_2.csv
Start a CDE Session of type Spark Connect. Edit the Session Name parameter so it doesn't collide with other users' sessions.
cde session create \
--name spark-connect-session-res \
--type spark-connect \
--num-executors 2 \
--driver-cores 2 \
--driver-memory "2g" \
--executor-cores 2 \
--executor-memory "2g" \
--mount-1-resource telcoFiles
In the Sessions UI, validate the Session is Running.
From the terminal, install the following Spark Connect prerequisites:
python -m venv spark_connect_jupyter
source spark_connect_jupyter/bin/activate
pip install numpy==1.26.4
pip install --upgrade cmake
pip install pyarrow==14.0.0
pip install cdeconnect.tar.gz
pip install pyspark-3.5.1.tar.gz
pip install jupyterlab
jupyter lab
You are now ready to connect to the CDE Session from your local JupyterLab instance using Spark Connect.
Now run each cell and observe outputs.