- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Submit spark job from outside cluster
Created on ‎08-24-2018 12:07 AM - edited ‎09-16-2022 06:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to submit spark job from outside the cluster to cloudera Cluster in Yarn mode? Is there any clients available to submit the spark job.
How to install Cloudera client to access the cloudera cluster from EdgeNode(outside of the cluster)? Any help and suggest would greatly appriciated.
Created ‎08-24-2018 12:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, if you are using Cloudera Manager deployed cluster with parcels, add a new host to the list of host and then deploy YARN and SPARK GATEWAY roles on this node. This will trigger the CM and it will distribute the parcels on this edge node and "activate" it.
After that you should have on PATH the following commands: spark-submit, spark-shell (or spark2-submit, spark2-shell if you deployed SPARK2_ON_YARN)
If you are using Kerberos, make sure you have the client libraries and valid krb5.conf file. And make sure you have a valid ticket in your cache.
Then to submit a spark job to YARN:
spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options] or spark-submit --class path.to.your.Class --master yarn --deploy-mode client [options] <app jar> [app options]
Created ‎08-24-2018 04:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
May be i missed my earlier question as I am trying to submit the job from a docker container from jupyet hub service. How do i do that?
Created on ‎08-28-2018 03:16 AM - edited ‎08-28-2018 03:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did anyone tried with the above scenario "submit the job from a docker container from jupyter hub service to CDH cluster. How do i do that?" @Tomas79
Created ‎01-28-2019 08:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey, interested in knowing if you were able to achieve this and how? I'm doing something similar
Created on ‎01-29-2019 09:13 AM - edited ‎01-29-2019 09:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note: The below process is easier if the node is a gateway node. The correct Spark version and the directories will be readily available for mounting to the docker container.
The quick and dirty way is to have an installation of Spark which matches your cluster's major version installed or mounted in the docker container.
As well, you will need to mount the yarn and hadoop configuration directories in the docker container. Mounting these will prevent you from needing to set a ton of config on submission. Eg:
"spark.hadoop.yarn.resourcemanager.hostname","XXX"
Often these both can be set to the same value: /opt/cloudera/parcels/SPARK2/lib/spark2/conf/yarn-conf.
The SPARK_CONF_DIR, HADOOP_CONF_DIR and YARN_CONF_DIR environment variables need to reference be set if using spark-submit. If using SparkLauncher, they can be set like so:
val env = Map(
"HADOOP_CONF_DIR" -> "/example/hadoop/path",
"YARN_CONF_DIR" -> "/example/yarn/path"
)
val launcher = new SparkLauncher(env.asJava).setSparkHome("/path/to/mounted/spark")
If submitting to a kerberized cluster, the easiest way is to mount a keytab file and the /etc/krb5.conf file in the docker container. Set the principal and keytab using spark.yarn.principal and spark.yarn.keytab, respectively.
For ports, 8032 of the Spark Master's (Yarn ResourceManager External) definitely needs to be open to traffic from the docker node. I am not sure if this is the complete list of ports - could another user verify?
