About Meesam

mtrepanier · ‎01-29-2019

Note: The below process is easier if the node is a gateway node. The correct Spark version and the directories will be readily available for mounting to the docker container. The quick and dirty way is to have an installation of Spark which matches your cluster's major version installed or mounted in the docker container. As well, you will need to mount the yarn and hadoop configuration directories in the docker container. Mounting these will prevent you from needing to set a ton of config on submission. Eg: "spark.hadoop.yarn.resourcemanager.hostname","XXX" Often these both can be set to the same value: /opt/cloudera/parcels/SPARK2/lib/spark2/conf/yarn-conf. The SPARK_CONF_DIR, HADOOP_CONF_DIR and YARN_CONF_DIR environment variables need to reference be set if using spark-submit. If using SparkLauncher, they can be set like so: val env = Map( "HADOOP_CONF_DIR" -> "/example/hadoop/path", "YARN_CONF_DIR" -> "/example/yarn/path" ) val launcher = new SparkLauncher(env.asJava).setSparkHome("/path/to/mounted/spark") If submitting to a kerberized cluster, the easiest way is to mount a keytab file and the /etc/krb5.conf file in the docker container. Set the principal and keytab using spark.yarn.principal and spark.yarn.keytab, respectively. For ports, 8032 of the Spark Master's (Yarn ResourceManager External) definitely needs to be open to traffic from the docker node. I am not sure if this is the complete list of ports - could another user verify?

Online	Offline
Last Visited	‎02-14-2019 03:56 PM

Member Since	‎01-28-2019 08:17 AM
Last Visited	‎02-14-2019 03:56 PM
Posts	1

Cloudera Community

Re: Submit spark job from outside cluster