Created on 09-03-2024 04:45 PM - edited on 09-09-2024 10:46 PM by VidyaSargur
Cloudera Data Engineering (CDE) is a cloud-native service provided by Cloudera. It is designed to simplify and enhance the development, deployment, and management of data engineering workloads at scale. CDE is part of the Cloudera Data Platform (CDP), which is a comprehensive, enterprise-grade platform for managing and analyzing data across hybrid and multi-cloud environments.
Cloudera Data Engineering offers several advantages. With CDE, you can create a "CDE Spark-Submit" using the same syntax as your regular Spark-Submit. Alternatively, you can specify your Spark-Submit as a "CDE Job of type Spark" using a reusable Job Definition, which enhances observability, troubleshooting, and dependency management.
These unique capabilities of CDE are especially useful for Spark Data Engineers who develop and deploy Spark Pipelines at scale. This includes working with different Spark-Submit definitions and dynamic, complex dependencies across multiple clusters.
For example, when packaging a JAR for a Spark Submit, you can include various types of dependencies that your Spark application requires to run properly. These can consist of application code (compiled Scala/Java code), third-party libraries (external dependencies), configuration and resource files (for application configuration or runtime data), and custom JARs (any internal or utility libraries your application needs).
In this article, you will learn how to effectively manage JAR dependencies and simplify Cloudera Data Engineering in various scenarios.
Scala Spark applications are typically developed and deployed in the following manner:
In this example, you will build a CDE Spark Job with a Scala application that has already been compiled into a JAR. To learn how to complete these steps, please visit this tutorial.
cde resource create --name cde_scala_job_files
cde resource upload --name cde_scala_job_files --local-path jars/cdejobjar_2.12-1.0.jar
cde job create \
--name cde-scala-job \
--type spark \
--mount-1-resource cde_scala_job_files \
--application-file cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g
cde job run --name cde-scala-job
You can add further JAR dependencies with the ```--jar``` or ```--jars``` options. In this case, you can add the Spark XML library from the same CDE Files Resource:
cde resource upload --name cde_scala_job_files --local-path jars/spark-xml_2.12-0.16.0.jar
cde job create \
--name cde-scala-job-jar-dependency \
--type spark \
--mount-1-resource cde_scala_job_files \
--application-file cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--jar spark-xml_2.12-0.16.0.jar
cde job run --name cde-scala-job-jar-dependency
Notice that you could achieve the same by using two CDE file resources, each containing one of the JARs. You can create as many CDE file resources as needed for each JAR file.
In the foloowing example, you will be referencing the application code JAR located in the "cde_scala_job_files" CDE Files Resource that you previously created, as well as an additional JAR for the Spark-XML package from a new CDE Files Resource that you will create as "cde_spark_xml_jar".
Note the use of the new "--mount-N-prefix" option below. When you are using more than one CDE Resource with the same "CDE Job Create" command, you need to assign an alias to each Files Resource so that each command option can correctly reference the files.
cde resource create --name cde_spark_xml_jar
cde resource upload --name cde_spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar
cde job create \
--name cde-scala-job-multiple-jar-resources \
--type spark \
--mount-1-prefix scala_app_code \
--mount-1-resource cde_scala_job_files \
--mount-2-prefix spark_xml_jar \
--mount-2-resource cde_spark_xml_jar \
--application-file scala_app_code/cdejobjar_2.12-1.0.jar \
--conf spark.sql.shuffle.partitions=10 \
--executor-cores 2 \
--executor-memory 2g \
--jar spark_xml_jar/spark-xml_2.12-0.16.0.jar
cde job run --name cde-scala-job-multiple-jar-resources
For Maven dependencies, you can use the `--packages` option to automatically download and include dependencies. This is often more convenient than manually managing JAR files. In the following example, the `--packages` option replaces the `--jars` option.
In this example, you will reference the Spark-XML package from Maven so that you can use it to parse the sample "books.xml" file from the CDE Files Resource.
cde resource create --name spark_files --type files
cde resource upload --name spark_files --local-path read_xml.py --local-path books.xml
cde job create --name sparkxml \
--application-file read_xml.py \
--mount-1-resource spark_files \
--type spark \
--packages com.databricks:spark-xml_2.12:0.16.0
cde job run --name sparkxml
Like in the previous example, multiple CDE file resources can be used to manage PySpark Application code and the sample XML file. Notice that the application code in ```read_xml_multi_resource.py``` is different. At line 67, the ```sample_xml_file``` Files Resource is referenced directly in the application code with its alias ```xml_data```.
cde resource create --name sample_xml_file --type files
cde resource create --name pyspark_script --type files
cde resource upload --name pyspark_script --local-path read_xml_multi_resource.py
cde resource upload --name sample_xml_file --local-path books.xml
cde job create --name sparkxml-multi-deps \
--application-file code/read_xml_multi_resource.py \
--mount-1-prefix code \
--mount-1-resource pyspark_script \
--mount-2-prefix xml_data \
--mount-2-resource sample_xml_file \
--type spark \
--packages com.databricks:spark-xml_2.12:0.16.0
cde job run --name sparkxml-multi-deps
Similar to example 1, you can reference JARs directly uploaded into CDE Files Resources instead of using Maven as in example 2.
The following commands pick up from example 2 but replace the ```packages``` option with the ```jars``` option.
Notice that the ```--jars``` option is used in the ```cde job run``` command rather than the ```cde job create```. The ```---jars``` option can either be set at CDE Job creation or runtime.
cde resource create --name spark_xml_jar --type files
cde resource upload --name spark_xml_jar --local-path jars/spark-xml_2.12-0.16.0.jar
cde job create --name sparkxml-multi-deps-jar-from-res \
--application-file code/read_xml_multi_resource.py \
--mount-1-prefix code \
--mount-1-resource pyspark_script \
--mount-2-prefix xml_data \
--mount-2-resource sample_xml_file \
--mount-3-prefix deps \
--mount-2-resource spark_xml_jar \
--type spark \
cde job run --name sparkxml-multi-deps-jar-from-res \
--jar deps/spark-xml_2.12-0.16.0.jar
In this article, the CDE CLI was used to simplify Spark JAR management with Cloudera Data Engineering.