Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SPARK and Docker in a multi node cluster


SPARK and Docker in a multi node cluster

New Contributor

Hi All,

If we use a docker container to deliver our Spark ML code in a multi node Hadoop cluster, does this impact the parallel execution of the Spark jobs? How the driver will communicate then with the executors and Resource manager? Does every thing remain the same as if we don't use docker?

Globally, what will be the impact, if any, to deliver Spark code within a docker container?



Re: SPARK and Docker in a multi node cluster

Super Collaborator

Without knowing how you are executing the whole process, it sounds like you ran spark-submit from a Docker container. Therefore, only the first process of spark-submit happened inside of Docker. If you have mounted the HADOOP_CONF directory into the container, then this is no different than outside the container.

Additionally, if you submitted as cluster mode to YARN, then the app master / driver & executors of Spark are no different than regular YARN processes; whereas, if you did it in client mode, the Spark driver remains within the Docker container until the Spark application ends.

Don't have an account?
Coming from Hortonworks? Activate your account here