Support Questions
Find answers, ask questions, and share your expertise

Install spark client in Dockerfile to access Yarn cluster on Cloudera

New Contributor

I'm having a hard time configuring properly spark on my Docker Image and to publish it on K8s.

 

The contexte his, I have a Cloudera Cluster that has Spark installed with a Yarn configuration. So I went into the Cloudera Manager to get the yarn-config.

Normaly we use CDSW to launch engine:8 images to use Spark with Yarn. But I wish to use spark on Yarn outside CDSW, in an another cluster (k8s) where I can make websites or API that will use Spark with Yarn on CDH.

 

Inside the yarn-config I have all the configuration about Kerberos, IPs and ports to use.

The Cloudera cluster is using Spark version 2.4.0-cdh6.3.4.

 

I find no proper documentation on how to install spark/pyspark with cloudera configuration and jars. (btw I do not have user/password to get the jars since I'm not the one with that have the subscription, I'm only a user on the cluster)

Where should I get and store PYSPARK_ARCHIVES_PATH or SPARK_DIST_CLASSPATH?

What I tried in my Dockerfile (Debian10):

I only need to use spark in client mode. I will never use it in local mode, only by yarn on the Cloudera Cluster. But should I install all the jars from https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.4.0-cdh6.3.4 ? If yes, how should I do this.

Also, should I open ports while I deploy with K8s? I am over thinking this? Because I find no documentation or tutorial that is in my use case.

0 REPLIES 0
; ;