I'm having a hard time configuring properly spark on my Docker Image and to publish it on K8s. The contexte his, I have a Cloudera Cluster that has Spark installed with a Yarn configuration. So I went into the Cloudera Manager to get the yarn-config. Normaly we use CDSW to launch engine:8 images to use Spark with Yarn. But I wish to use spark on Yarn outside CDSW, in an another cluster (k8s) where I can make websites or API that will use Spark with Yarn on CDH. Inside the yarn-config I have all the configuration about Kerberos, IPs and ports to use. The Cloudera cluster is using Spark version 2.4.0-cdh6.3.4. I find no proper documentation on how to install spark/pyspark with cloudera configuration and jars. (btw I do not have user/password to get the jars since I'm not the one with that have the subscription, I'm only a user on the cluster) Where should I get and store PYSPARK_ARCHIVES_PATH or SPARK_DIST_CLASSPATH? What I tried in my Dockerfile (Debian10) : install Java jdk-8 with http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz (or OpenJDK8) setup keytool with coorporate cert files. install spark 2.4 with https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz install pyspark==2.4.0 set SPARK_CONF_DIR HADOOP_CONF_DIR to /etc/yarn-conf (where I copy the yarn-conf with core-site.xml and yarn-site.xml) I only need to use spark in client mode. I will never use it in local mode, only by yarn on the Cloudera Cluster. But should I install all the jars from https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.4.0-cdh6.3.4 ? If yes, how should I do this. Also, should I open ports while I deploy with K8s? I am over thinking this? Because I find no documentation or tutorial that is in my use case.
... View more