Community Articles

Find and share helpful community-sourced technical articles.
avatar
Rising Star

Objective

Cloudera Data Engineering (CDE) is a cloud-native service provided by Cloudera. It is designed to simplify and enhance the development, deployment, and management of data engineering workloads at scale. CDE is part of the Cloudera Data Platform (CDP), which is a comprehensive, enterprise-grade platform for managing and analyzing data across hybrid and multi-cloud environments.

Cloudera Data Engineering offers several advantages. With CDE, you can create a "CDE Spark-Submit" using the same syntax as your regular Spark-Submit. Alternatively, you can specify your Spark-Submit as a "CDE Job of type Spark" using a reusable Job Definition, which enhances observability, troubleshooting, and dependency management.

These unique capabilities of CDE are especially useful for Spark Data Engineers who develop and deploy Spark Pipelines at scale. This includes working with different Spark-Submit definitions and dynamic, complex dependencies across multiple clusters.

For example, when packaging a JAR for a Spark Submit, you can include various types of dependencies that your Spark application requires to run properly. These can consist of application code (compiled Scala/Java code), third-party libraries (external dependencies), configuration and resource files (for application configuration or runtime data), and custom JARs (any internal or utility libraries your application needs).

In this article, you will learn how to effectively manage JAR dependencies and simplify Cloudera Data Engineering in various scenarios.

Example 1: CDE Job with Scala Application Code in Spark Jar

Scala Spark applications are typically developed and deployed in the following manner:

  1. Set Up Project in IDE: Use SBT to set up a Scala project in your IDE.
  2. Write Code: Write your Scala application.
  3. Compile & Package: Use the sbt package to compile and package your code into a JAR.
  4. Submit to Spark: Use spark-submit to run your JAR on a Spark cluster.

In this example, you will build a CDE Spark Job with a Scala application that has already been compiled into a JAR. To learn how to complete these steps, please visit this tutorial.

 

cde resource create --name cde_scala_job_files

cde resource upload --name cde_scala_job_files --local-path jars/cdejobjar_2.12-1.0.jar

cde job create \
--name cde-scala-job \
--type spark \
--mount-1-resource cde_scala_job_files \
--application-file cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g

cde job run --name cde-scala-job

 

You can add further JAR dependencies with the ```--jar``` or ```--jars``` options. In this case, you can add the Spark XML library from the same CDE Files Resource:

 

cde resource upload --name cde_scala_job_files --local-path jars/spark-xml_2.12-0.16.0.jar

cde job create \
--name cde-scala-job-jar-dependency \
--type spark \
--mount-1-resource cde_scala_job_files \
--application-file cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--jar spark-xml_2.12-0.16.0.jar

cde job run --name cde-scala-job-jar-dependency

 

Notice that you could achieve the same by using two CDE file resources, each containing one of the JARs. You can create as many CDE file resources as needed for each JAR file.

In the foloowing example, you will be referencing the application code JAR located in the "cde_scala_job_files" CDE Files Resource that you previously created, as well as an additional JAR for the Spark-XML package from a new CDE Files Resource that you will create as "cde_spark_xml_jar".

Note the use of the new "--mount-N-prefix" option below. When you are using more than one CDE Resource with the same "CDE Job Create" command, you need to assign an alias to each Files Resource so that each command option can correctly reference the files.

 

cde resource create --name cde_spark_xml_jar

cde resource upload --name cde_spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar

cde job create \
--name cde-scala-job-multiple-jar-resources \
--type spark \
--mount-1-prefix scala_app_code \
--mount-1-resource cde_scala_job_files \
--mount-2-prefix spark_xml_jar \
--mount-2-resource cde_spark_xml_jar \
--application-file scala_app_code/cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--jar spark_xml_jar/spark-xml_2.12-0.16.0.jar

cde job run --name cde-scala-job-multiple-jar-resources

 

Example 2: CDE Job with PySpark Application Code and Jar Dependency from Maven

For Maven dependencies, you can use the `--packages` option to automatically download and include dependencies. This is often more convenient than manually managing JAR files. In the following example, the `--packages` option replaces the `--jars` option.

In this example, you will reference the Spark-XML package from Maven so that you can use it to parse the sample "books.xml" file from the CDE Files Resource.

 

cde resource create --name spark_files --type files

cde resource upload --name spark_files --local-path read_xml.py --local-path books.xml

cde job create --name sparkxml \
--application-file read_xml.py \
--mount-1-resource spark_files \
--type spark \
--packages com.databricks:spark-xml_2.12:0.16.0

cde job run --name sparkxml

Like in the previous example, multiple CDE file resources can be used to manage PySpark Application code and the sample XML file. Notice that the application code in ```read_xml_multi_resource.py``` is different. At line 67, the ```sample_xml_file``` Files Resource is referenced directly in the application code with its alias ```xml_data```.

 

 

cde resource create --name sample_xml_file --type files
cde resource create --name pyspark_script --type files

cde resource upload --name pyspark_script --local-path read_xml_multi_resource.py
cde resource upload --name sample_xml_file --local-path books.xml

cde job create --name sparkxml-multi-deps \
--application-file code/read_xml_multi_resource.py \
--mount-1-prefix code \
--mount-1-resource pyspark_script \
--mount-2-prefix xml_data \
--mount-2-resource sample_xml_file \
--type spark \
--packages com.databricks:spark-xml_2.12:0.16.0

cde job run --name sparkxml-multi-deps

 

Example 3: CDE Job with PySpark Application Code and Jar Dependency from CDE Files Resource

Similar to example 1, you can reference JARs directly uploaded into CDE Files Resources instead of using Maven as in example 2.

The following commands pick up from example 2 but replace the ```packages``` option with the ```jars``` option.

Notice that the ```--jars``` option is used in the ```cde job run``` command rather than the ```cde job create```. The ```---jars``` option can either be set at CDE Job creation or runtime.

 

cde resource create --name spark_xml_jar --type files

cde resource upload --name spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar

cde job create --name sparkxml-multi-deps-jar-from-res \
--application-file code/read_xml_multi_resource.py \
--mount-1-prefix code \
--mount-1-resource pyspark_script \
--mount-2-prefix xml_data \
--mount-2-resource sample_xml_file \
--mount-3-prefix deps \
--mount-2-resource spark_xml_jar \
--type spark \

cde job run --name sparkxml-multi-deps-jar-from-res \
--jar deps/spark-xml_2.12-0.16.0.jar

 

Summary

In this article, the CDE CLI was used to simplify Spark JAR management with Cloudera Data Engineering.

  • You can utilize the CDE CLI to create CDE Job Definitions using Spark JAR dependencies and to create CDE file resources to store and reference one or multiple JARs.
  • Cloudera Data Engineering offers significant improvements in Spark Dependency Management compared to traditional Spark-Submits outside of CDE.
  • The Job Runs page in the CDE UI can be used to monitor JAR dependencies applied to each job execution. Cloudera Data Engineering presents substantial advancements in Spark Observability and Troubleshooting compared to traditional Spark-Submits outside of CDE.

References & Useful Articles

1,264 Views
0 Kudos