Created on 04-21-2023 12:00 PM - edited on 05-01-2023 09:46 PM by VidyaSargur
Cloudera Data Engineering (CDE) service is a robust and flexible platform for managing Data Engineering workloads.CDE service allows you to manage workload configurations such as choosing the spark version to use, orchestrate your DE pipelines using Airflow as well as provides a rich API to manage and automate pipelines remotely from the command line. At times, however, one is unable to use the standard CDE configurations for any of the following reasons,
Such unique requirements are addressed by building custom runtimes and pushing these runtimes as docker images in an external docker registry. CDE services pull these images from these remote registries and create the context by setting up the software packages and versions required for running such “special” workloads.
This article provides a step-by-step guide to creating a custom runtime and then pulling this runtime image from a custom docker registry for running a sample CDE Workload.
Runtime Configuration / Requirement:
Before getting started, make sure that you have the following prerequisites setup:
The figure above describes the high-level steps for the Custom Runtime Build solution. It is mandatory that you start with a Cloudera-provided runtime image as a first step in building your custom docker image. To access the Cloudera docker registry contact your Cloudera account administrator, as this requires a license. Refer here for some details on the process if you are having a private cloud CDP installation.
Create a folder on your local machine where you have the docker client installed and clone the sample code and change to the newly created directory:
$ git clone git@github.com:SuperEllipse/CDE-runtime.git
$ cd CDE-runtime
In this step, we will fetch the base Image from the Cloudera repository and then customize the image to include packages and Python versions per our requirements.
Note: This requires access to the Cloudera repository. Contact your Cloudera account team contact to enable this if you do not have access. Once access to the repository is established, you should be able to build this Dockerfile into a Docker image. Change the dockerfile to add the name of the user per your requirement i.e. substitute vishrajagopalan to <<my-user-name >>.
Name : Dockerfile
FROM container.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.2.3-7.2.15.8:1.20.0-b15
USER root
RUN groupadd -r vishrajagopalan && useradd -r -g vishrajagopalan vishrajagopalan
RUN yum install ${YUM_OPTIONS} gcc openssl-devel libffi-devel bzip2-devel wget python39 python39-devel && yum clean all && rm -rf /var/cache/yum
RUN update-alternatives --remove-all python
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
RUN rm /usr/bin/python3
RUN ln -s /usr/bin/python3.9 /usr/bin/python3
RUN yum -y install python39-pip
RUN /usr/bin/python3.9 -m pip install --upgrade pip
RUN /usr/bin/python3.9 -m pip install pandas==2.0.0 impyla==0.18.0 dbt-core==1.3.1 dbt-impala==1.3.1 dbt-hive==1.3.1 impyla==0.18.0 confluent-kafka[avro,json,protobuf]==1.9.2
ENV PYTHONPATH="${PYTHONPATH}:/usr/local/lib64/python3.9/site-packages:/usr/local/lib/python3.9/site-packages"
#RUN echo $PYTHONPATH
RUN dbt --version
RUN /usr/bin/python3.9 - -c "import pandas; from impala.dbapi import connect "
USER vishrajagopalan
The following steps are needed to be executed in sequence for you to build and push your docker image.
docker build --network=host -t vishrajagopalan/dex-spark-runtime-3.2.3-7.2.15.8:1.20.0-b15-custom . -f Dockerfile
docker push vishrajagopalan/dex-spark-runtime-3.2.3-7.2.15.8:1.20.0-b15-custom
cde credential create --name docker-creds --type docker-basic --docker-server hub.docker.com --docker-username vishrajagopalan
cde resource create --name dex-spark-runtime-custom --image vishrajagopalan/dex-spark-runtime-3.2.3-7.2.15.8:1.20.0-b15-custom --image-engine spark3 --type custom-runtime-image
cde spark submit --user=vishrajagopalan spark-sql.py --runtime-image-resource-name=dex-spark-runtime-custom
cde resource create --name sparkfiles-resource
cde resource upload --name sparkfiles-resource --local-path $HOME/Work/sandbox/CDE-runtime/spark-sql.py
cde job create --name custom_runtime_job_spark3 --type spark --mount-1-resource dex-spark-runtime-custom --mount-2-resource sparkfiles-resource --application-file spark-sql.py --user vishrajagopalan --runtime-image-resource-name dex-spark-runtime-custom
cde job run --name custom_runtime_job_spark3
This article demonstrates the steps needed to build a custom runtime image, push the image in a private docker registry and use this custom runtime in a CDE workload. While this example demonstrates usage with a Docker hub, you can also use the private registry in your organization for this purpose.
Filename: spark-sql.py
from __future__ import print_function
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
import sys
spark = SparkSession\
.builder\
.appName("PythonSQL")\
.getOrCreate()
# A list of Rows. Infer schema from the first row, create a DataFrame and print the schema
rows = [Row(name="John", age=19), Row(name="Smith", age=23), Row(name="Sarah", age=18)]
some_df = spark.createDataFrame(rows)
some_df.printSchema()
# A list of tuples
tuples = [("John", 19), ("Smith", 23), ("Sarah", 18)]
# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False),
StructField("person_age", IntegerType(), False)])
# Create a DataFrame by applying the schema to the RDD and print the schema
another_df = spark.createDataFrame(tuples, schema)
another_df.printSchema()
for each in another_df.collect():
print(each[0])
print("Python Version")
print(sys.version)
spark.stop()