Created on 03-14-2025 01:24 PM - edited 03-16-2025 06:33 PM
In this brief example you will learn how to use the CDE CLI to create a CDE Spark Job with PySpark and Scala app code located in an S3 bucket.
In order to reproduce these examples in your CDE Virtual Cluster you will need:
CDE provides a command line interface (CLI) client. You can use the CLI to create and update jobs, view job details, manage job resources, run jobs, etc.
Apache Spark Spark-Submit allows you to run a Spark job with application code located in an S3 bucket. The CDE CLI also provides this functionality.
For example, in PySpark:
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files s3://your-bucket/path/to/dependency_one.zip, s3://your-bucket/path/to/dependency_two.py \
--jars s3://your-bucket/path/to/dependency_one.jar,s3://your-bucket/path/to/dependency_two.jar \
s3://your-bucket/path/to/pyspark_app.py \
--arg1 value_one --arg2 value_two
Or with a Jar compiled from Scala application code:
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files s3://your-bucket/path/to/dependency_one.zip, s3://your-bucket/path/to/dependency_two.py \
--jars s3://your-bucket/path/to/dependency_one.jar,s3://your-bucket/path/to/dependency_two.jar \
s3://your-bucket/path/to/spark_app.jar \
--arg1 value_one --arg2 value_two
You can accomplish the same with the CDE CLI, either by creating a CDE CLI Spark Submit or a CDE Job.
CDE Spark Submit with PySpark application:
cde spark submit \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
s3://default-cdp-bucket/data-eng-artifacts/cde_spark_jobs/simple-pyspark-sql.py
CDE Job with PySpark application:
cde job create \
--application-file s3://your-bucket/path/to/pyspark_app.py \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--py-files s3://your-bucket/path/to/dependency_one.zip, s3://your-bucket/path/to/dependency_two.py \
--jars s3://default-cdp-bucket/path/to/dependency_one.jar,s3://your-bucket/path/to/dependency_two.jar \
--arg1 value_one
CDE Spark Submit with Scala application:
cde spark submit \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
s3://your-bucket/path/to/spark_app.jar
CDE Job with Scala application:
cde job create \
--application-file s3://your-bucket/path/to/spark_app.jar \
--py-files s3://your-bucket/path/to/dependency_one.zip, s3://your-bucket/path/to/dependency_two.py \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--jars s3://default-cdp-bucket/path/to/dependency_one.jar,s3://your-bucket/path/to/dependency_two.jar \
--arg1 value_one
For example, in the case of a sample PySpark application:
CDE Spark Submit:
cde spark submit \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
s3://pdf-3425-buk-c59557bd/data-eng-artifacts/cde_spark_jobs/simple-pyspark-sql.py
CDE Job:
cde job create \
--name my-cde-job-from-s3-pyspark \
--type spark \
--application-file s3://pdf-3425-buk-c59557bd/data-eng-artifacts/cde_spark_jobs/simple-pyspark-sql.py \
--conf spark.sql.shuffle.partitions=10 \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--executor-cores 2 \
--executor-memory 2g
cde job run \
--name my-cde-job-from-s3-pyspark
Or with a Scala Jar.
CDE Spark Submit:
cde spark submit \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
s3://pdf-3425-buk-c59557bd/data-eng-artifacts/cde_spark_jobs/cde-scala-example_2.12-0.1.jar
cde job create \
--name my-cde-job-from-s3-scalajar \
--type spark \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--application-file s3://data-eng-artifacts/cde_spark_jobs/cde-scala-example_2.12-0.1.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g
cde job run \
--name my-cde-job-from-s3-scalajar
As an alternative to hosting your application code and file dependencies in S3, you can leverage CDE Files Resources.
Files Resources are arbitrary collections of files that a job can reference where application code and any necessary configuration files or supporting libraries can be stored. Files can be uploaded to and removed from a resource as needed.
CDE Files Resources offer a few key advantages:
You can create a CDE Files reosource with the CLI.
cde resource create \
--name my-files-resource \
--type files
You can upload files to the Resource:
cde resource upload \
--name my-files-resource \
--local-path simple-pyspark-sql.py
And finally you can mount the Files Resource when creating the CDE Job Definition:
cde job create \
--type spark \
--name my-job-with-resource \
--mount-1-resource my-files-resource \
--application-file simple-pyspark-sql.py
And finally run it with:
cde job run \
--name my-job-with-resource \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g
For more in-depth information on using CDE Resources please visit the following publications:
Cloudera Data Engineering (CDE) provides a command line interface (CLI) client. You can use the CLI to create and update jobs, view job details, manage job resources, run jobs, and so on.
In this article we have reviewed some advanced use cases for the CLI. If you are using the CDE CLI you might also find the following articles and demos interesting: