If we use a docker container to deliver our Spark ML code in a multi node Hadoop cluster, does this impact the parallel execution of the Spark jobs? How the driver will communicate then with the executors and Resource manager? Does every thing remain the same as if we don't use docker?
Globally, what will be the impact, if any, to deliver Spark code within a docker container?
Without knowing how you are executing the whole process, it sounds like you ran spark-submit from a Docker container. Therefore, only the first process of spark-submit happened inside of Docker. If you have mounted the HADOOP_CONF directory into the container, then this is no different than outside the container.
Additionally, if you submitted as cluster mode to YARN, then the app master / driver & executors of Spark are no different than regular YARN processes; whereas, if you did it in client mode, the Spark driver remains within the Docker container until the Spark application ends.