I'm having a hard time configuring properly spark on my Docker Image and to publish it on K8s.
The contexte his, I have a Cloudera Cluster that has Spark installed with a Yarn configuration. So I went into the Cloudera Manager to get the yarn-config.
Normaly we use CDSW to launch engine:8 images to use Spark with Yarn. But I wish to use spark on Yarn outside CDSW, in an another cluster (k8s) where I can make websites or API that will use Spark with Yarn on CDH.
Inside the yarn-config I have all the configuration about Kerberos, IPs and ports to use.
The Cloudera cluster is using Spark version 2.4.0-cdh6.3.4.
I find no proper documentation on how to install spark/pyspark with cloudera configuration and jars. (btw I do not have user/password to get the jars since I'm not the one with that have the subscription, I'm only a user on the cluster)
Where should I get and store PYSPARK_ARCHIVES_PATH or SPARK_DIST_CLASSPATH?
What I tried in my Dockerfile (Debian10):
I only need to use spark in client mode. I will never use it in local mode, only by yarn on the Cloudera Cluster. But should I install all the jars from https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.4.0-cdh6.3.4 ? If yes, how should I do this.
Also, should I open ports while I deploy with K8s? I am over thinking this? Because I find no documentation or tutorial that is in my use case.