Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Running Spark apps in Docker Containers on YARN

Highlighted

Running Spark apps in Docker Containers on YARN

New Contributor

Hi,

I have followed the following guide to launch spark apps in Docker containers on YARN: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/running-spark-applications/content/running_spa...

I need to launch the apps in cluster-mode (--deploy-mode cluster) as I work in a multitenant environment. This is my submit command:

 

spark-submit --master yarn --deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=<my_docker_image> \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/krb5.conf:/etc/krb5.conf:ro \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
--conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=<my_docker_image> \
--conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/krb5.conf:/etc/krb5.conf:ro \
--conf spark.yarn.AppMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
my_app.py --app_name my_app_name

And my Dockerfile:

 

FROM ubuntu:18.04

RUN apt-get update && apt-get install -y openjdk-8-jdk python-pip
RUN ln -s /usr/lib/jvm/java-1.8.0-openjdk-amd64 /usr/lib/jvm/java

RUN pip install -U pip
RUN pip install pyspark==2.3.2 numpy pandas

Although I am setting all the configurations regarding to spark.yarn.AppMasterEnv.*, the driver cannot find any of my dependencies. However, if I install the dependencies locally in my master node and I submit the app with "--deploy-mode client" (regular users are not allowed to do this, they have to submit the job from their JupyterHub environments), it works, so it seems that when I set "--deploy-mode cluster"  the Driver/AM is executing directly on nodes instead of executing at Docker.

 

I don't know if this is the expected behavior or if I'm missing something.

 

Thank you in advance.

Don't have an account?
Coming from Hortonworks? Activate your account here