- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 12-08-2017 05:38 AM
How to Submit Spark Application through Livy REST API
Apache Livy supports using REST APIs to submit Spark applications, it is quite similar to use “spark-submit” in vanilla Spark. In this article we will briefly introduce how to use Livy REST APIs to submit Spark applications, and how to transfer existing “spark-submit” command to REST APIs.
Using spark-submit
In vanilla Spark, normally we should use “spark-submit” command to submit Spark application to a cluster, a “spark-submit” command is like:
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
Some of the key options are:
- --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi).
- --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077).
- --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client).
- application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
- application-arguments: Arguments passed to the main method of your main class, if any.
To make Spark application running on cluster manager, we should specify “--master” and “--deploy-mode” to choose which cluster manager to run Spark application in which mode. Beside, we should let “spark-submit” to know the application’s entry point as well as application jar, arguments, these are specified through “--class”, “<application-jar>” and “[application-argument]”.
It is quite easy to use “spark-submit” command to specify different arguments, user can use “spark-submit --help” to show all the supported arguments.
Then how to use Livy REST APIs to submit Spark applications?
Using Livy REST APIs
Livy REST APIs offers the equal ability to submit application through REST APIs, let’s see how to convert the above “spark-submit” command to a REST protocol.
{ “file”: “<application-jar>”, “className”: “<main-class>”, “args”: [args1, args2, ...], “conf”: {“foo1”: “bar1”, “foo2”: “bar2”...} }
This is a JSON protocol to submit Spark application, to submit Spark application to cluster manager, we should use HTTP POST request to send above JSON protocol to Livy Server:
curl -H "Content-Type: application/json" -X POST -d ‘<JSON Protocol>’ <livy-host>:<port>/batches
As you can see most of the arguments are the same, but there still has some differences:
- “master” and “deployMode” cannot be directly specified in REST API, instead you have two ways to specify these:
- Via Livy configurations “livy.spark.master” and “livy.spark.deploy-mode”. Livy will honor this two configurations and set in session creation.
- Via Spark configurations, setting “spark.master” and “spark.submit.deployMode” in “conf” field of REST protocol.
- The “application-jar” should be reachable by remote cluster manager, which means this “application-jar” should be put onto a distributed file system like HDFS. This is different from “spark-submit” because “spark-submit” also handles uploading jars from local disk, but Livy REST APIs doesn’t do jar uploading.
For all the other settings including environment variables, they should be configured in spark-defaults.conf and spark-env.sh file under <SPARK_HOME>/conf.
Examples
Here is the example to transfer “spark-submit” command to Livy REST APIs.
spark-submit command | Livy REST JSON protocol |
./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --jars a.jar,b.jar \ --pyFiles a.py,b.py \ --files foo.txt,bar.txt \ --archives foo.zip,bar.tar \ --master yarn \ --deploy-mode cluster \ --driver-memory 10G \ --driver-cores 1 \ --executor-memory 20G \ --executor-cores 3 \ --num-executors 50 \ --queue default \ --name test \ --proxy-user foo \ --conf spark.jars.packages=xxx \ /path/to/examples.jar \ 1000 | { “className”: “org.apache.spark.examples.SparkPi”, “jars”: [“a.jar”, “b.jar”], “pyFiles”: [“a.py”, “b.py”], “files”: [“foo.txt”, “bar.txt”], “archives”: [“foo.zip”, “bar.tar”], “driverMemory”: “10G”, “driverCores”: 1, “executorCores”: 3, “executorMemory”: “20G”, “numExecutors”: 50, “queue”: “default”, “name”: “test”, “proxyUser”: “foo”, “conf”: {“spark.jars.packages”: “xxx”}, “file”: “hdfs:///path/to/examples.jar”, “args”: [1000], } |
Created on 12-09-2017 06:21 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
This is great thanks! Also, before too long there should be a NiFi processor and controller service to help with some of the session management (NIFI-4683).
Created on 06-03-2019 05:14 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
2019-06-01 00:43:19,160 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.io.FileNotFoundException: File hdfs://localhost:9000/home/spark-2.4.3-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.3.jar does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795) org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) |
Created on 06-03-2019 05:17 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
hi i have exception java.io.FileNotFoundException
2019-06-01 00:43:19,160 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.FileNotFoundException: File hdfs://localhost:9000/home/spark-2.4.3-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.3.jar does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Created on 07-02-2019 12:00 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@sshao am submitting spark job through livy , below are my conf parameters
data = {'kind': 'pyspark','driverMemory':'2G','driverCores':2,'numExecutors': 1, 'executorMemory': '1G','executorCores':1,'conf':{'spark.yarn.appMasterEnv.PYSPARK_PYTHON':'/usr/bin/python3'}}
> livy.spark.master : yarn-cluster
and in the above explanation you mentioned "livy.spark.deploy-mode" property , i guess it is "livy.spark.deployMode" correct me if am wrong.
In addition to above what other config should i change.
Created on 08-27-2021 10:57 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
How does livy know which shell to use? say if I use kind:spark or kind:pyspark, does it connect using scala or pyspark shell to start a session?
Created on 08-30-2021 07:52 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content