Created on 08-27-2024 05:01 PM - edited 08-27-2024 05:03 PM
Cloudera Data Engineering (CDE) is a cloud native service for Cloudera Data Platform that allows you to submit batch jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure.
Wheels allow for faster installations and more stability in the package distribution process. In the context of PySpark, Wheels allow you to make python dependent modules available to executors without having to do pip install dependencies on every node and to use application source code as a package.
In this tutorial you will create a CDE Spark Job using a Wheel file via the CDE CLI.
In order to execute the Hands On Labs you need:
* A Spark 3 and Iceberg-enabled CDE Virtual Cluster (Azure, AWS and Private Cloud ok).
* The CDE CLI installed on your local machine. If you need to install it for the first time please follow these steps.
* Familiarity with Python, PySpark and the CDE CLI is highly recommended.
* No script code changes are required.
Clone this GitHub repository to your local machine or the VM where you will be running the script.
mkdir ~/Documents/cde_wheel_jobs
cd ~/Documents/cde_wheel_jobs
git clone https://github.com/pdefusco/CDE_Wheel_Jobs.git
Alternatively, if you don't have git installed on your machine, create a folder on your local computer; navigate to this URL and manually download the files.
The Spark Job code can be found in the ```mywheel/__main__.py``` file but it does not require modifications. For demo purposes we have chosen to use a simple Spark SQL job.
The Wheel has already been created for you and will automatically download to the ```dist``` directory in your local machine upon cloning this project.
A CDE Spark Submit is the fastest way to prototype a Spark Job. In this example we will run a CDE Spark Submit with the Wheel file.
Once you have the CDE CLI installed on your terminal you can launch a CDE Job from local via the CDE CLI via the ```cde spark submit``` command. Copy the following command and execute it in your terminal:
cde spark submit --py-files dist/mywheel-0.0.1-py3-none-any.whl mywheel/__main__.py
In the terminal, validate that the Spark Job has launched successfully and note the Job Run ID.
Next, navigate to the CDE Job Runs UI and validate job execution:
Open the Job Configuration tab and notice that the Wheel has been uploaded in a File Resource for you.
However, notice that the Job Configuration tab does not provide means to edit or reschedule the job definition. In other words the entries in the Configuration tab are final. In order to be able to change the definition we will need to create a CDE Spark Job.
Similar to a CDE Spark Submit a CDE Spark Job is Application code to execute a Spark Job in a CDE Virtual Cluster. However, the CDE Job allows you to easily define, edit and reuse configurations and resources in future runs. Jobs can be run on demand or scheduled. An individual job execution is called a job run.
In this example we will create a CDE Resource of type File and upload the Spark Application code and the Wheel dependency. Then, we will run the Job.
Execute the following CDE CLI commands in your local terminal.
Create the File Resource:
cde resource create --name mywheels
Upload Application Code and Wheel to the File Resource:
cde resource upload --name mywheels --local-path dist/mywheel-0.0.1-py3-none-any.whl
cde resource upload --name mywheels --local-path mywheel/__main__.py
Navigate to the CDE Resource tab and validate that the Resource and the corresponding files are now available.
Create the CDE Spark Job definition. Navigate to the CDE Jobs UI and notice a new CDE Spark Job has been created. The job hasn't run yet so only the Configuration tab is populated with the Spark Job definition.
cde job create --name cde_wheel_job --type spark --py-files mywheel-0.0.1-py3-none-any.whl --application-file __main__.py --mount-1-resource mywheels
Finally, run the job. Now the Job Runs will include a new entry reflecting Job execution:
cde job run --name cde_wheel_job
Notice that the CDE Job definition can now be edited. This allows you to make changes to files and dependencies, create or change job execution schedule, and more.
For example, the CDE Job can now be executed again.
CDE is the Cloudera Data Engineering Service, a containerized managed service for Spark and Airflow.
If you are exploring CDE you may find the following tutorials relevant:
For more information on the Cloudera Data Platform and its form factors please visit this site.
For more information on migrating Spark jobs to CDE, please reference this guide.
If you have any questions about CML or would like to see a demo, please reach out to your Cloudera Account Team or send a message through this portal and we will be in contact with you soon.