About sshao

sshao · ‎12-08-2017

How to Submit Spark Application through Livy REST API Apache Livy supports using REST APIs to submit Spark applications, it is quite similar to use “spark-submit” in vanilla Spark. In this article we will briefly introduce how to use Livy REST APIs to submit Spark applications, and how to transfer existing “spark-submit” command to REST APIs. Using spark-submit In vanilla Spark, normally we should use “spark-submit” command to submit Spark application to a cluster, a “spark-submit” command is like: ./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] Some of the key options are: --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi). --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077). --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client). application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. application-arguments: Arguments passed to the main method of your main class, if any. To make Spark application running on cluster manager, we should specify “--master” and “--deploy-mode” to choose which cluster manager to run Spark application in which mode. Beside, we should let “spark-submit” to know the application’s entry point as well as application jar, arguments, these are specified through “--class”, “<application-jar>” and “[application-argument]”. It is quite easy to use “spark-submit” command to specify different arguments, user can use “spark-submit --help” to show all the supported arguments. Then how to use Livy REST APIs to submit Spark applications? Using Livy REST APIs Livy REST APIs offers the equal ability to submit application through REST APIs, let’s see how to convert the above “spark-submit” command to a REST protocol. { “file”: “<application-jar>”, “className”: “<main-class>”, “args”: [args1, args2, ...], “conf”: {“foo1”: “bar1”, “foo2”: “bar2”...} } This is a JSON protocol to submit Spark application, to submit Spark application to cluster manager, we should use HTTP POST request to send above JSON protocol to Livy Server: curl -H "Content-Type: application/json" -X POST -d ‘<JSON Protocol>’ <livy-host>:<port>/batches As you can see most of the arguments are the same, but there still has some differences: “master” and “deployMode” cannot be directly specified in REST API, instead you have two ways to specify these: Via Livy configurations “livy.spark.master” and “livy.spark.deploy-mode”. Livy will honor this two configurations and set in session creation. Via Spark configurations, setting “spark.master” and “spark.submit.deployMode” in “conf” field of REST protocol. The “application-jar” should be reachable by remote cluster manager, which means this “application-jar” should be put onto a distributed file system like HDFS. This is different from “spark-submit” because “spark-submit” also handles uploading jars from local disk, but Livy REST APIs doesn’t do jar uploading. For all the other settings including environment variables, they should be configured in spark-defaults.conf and spark-env.sh file under <SPARK_HOME>/conf. Examples Here is the example to transfer “spark-submit” command to Livy REST APIs. spark-submit command Livy REST JSON protocol ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --jars a.jar,b.jar \ --pyFiles a.py,b.py \ --files foo.txt,bar.txt \ --archives foo.zip,bar.tar \ --master yarn \ --deploy-mode cluster \ --driver-memory 10G \ --driver-cores 1 \ --executor-memory 20G \ --executor-cores 3 \ --num-executors 50 \ --queue default \ --name test \ --proxy-user foo \ --conf spark.jars.packages=xxx \ /path/to/examples.jar \ 1000 { “className”: “org.apache.spark.examples.SparkPi”, “jars”: [“a.jar”, “b.jar”], “pyFiles”: [“a.py”, “b.py”], “files”: [“foo.txt”, “bar.txt”], “archives”: [“foo.zip”, “bar.tar”], “driverMemory”: “10G”, “driverCores”: 1, “executorCores”: 3, “executorMemory”: “20G”, “numExecutors”: 50, “queue”: “default”, “name”: “test”, “proxyUser”: “foo”, “conf”: {“spark.jars.packages”: “xxx”}, “file”: “hdfs:///path/to/examples.jar”, “args”: [1000], }

sshao · ‎08-24-2016

SPARK_MAJOR_VERSION is a environment variable, you could set into bashrc or anywhere like normal environment variable. By default SPARK_MAJOR_VERSION=1, which means it will pick Spark 1.6.2 by default. If you want to choose Spark2, you could set SPARK_MAJOR_VERSION=2 before you run spark-shell.

sshao · ‎08-24-2016

Spark 2.0 will be shipped as TP in HDP2.5 AFAIK, you don't need to manually install it from Apache community.

sshao · ‎08-23-2016

1. This can be controlled through configuration, please see http://spark.apache.org/docs/latest/configuration.html#memory-management 2. No, you cannot disable non-memory caching, but you could choose only MEMORY related storage level to avoid spilling to disk when memory is full. 3. No, the data is not encrypted, and there's no way to encrypt spilled data currently. 4. It depends on different streaming sources you choose. For Kafka it supports ssl or sasl encryption. 5. same as #2.

sshao · ‎07-22-2016

1. If the initialExecutors = 5 as you set, the initial number of executors should be 5 as monitored. 2. Another concern is that if you want to bring up the executor number to maximum, you need to submit jobs continuously to increase the load of scheduler. In your case SparkPi application only has one job, so the load of scheduler will be decrease once after the job is submitted and Dynamic Allocation will schedule the new expected resource number. 3. If you're using latest version of HDP, thriftserver is already installed with dynamic allocation enabled be default. You could use beeline to submit SQL queries. 4. The logics of code is correct, I think what you need to do is to try different scenarios to trigger that logic.

sshao · ‎07-21-2016

I think SparkPi cannot effective validate the functionality of dynamic allocation, basically because it runs so fast that give little time for dynamic allocation to bring up more executors. For dynamic allocation, it needs to detect current loads (num of tasks) and calculate the expected number of executors, and then issue requests to AM/RM/NM through RPC, usually it took several seconds to handle this pipeline. But I guess the whole SparkPi application will only run several seconds, so it is too fast for dynamic allocation to fully request the expected resources. If you want to verify the functionality of dynamic allocation, Spark thriftserver is good candidate.

sshao · ‎03-30-2016

Actually the way here by setting through Configuration is the the same as what I above by using SparkConf, SparkConf will pick out all the configurations started with "spark.hadoop", remove the prefix and set into Hadoop Configuration.

sshao · ‎03-29-2016

Can you please take a try with: conf.set("spark.hadoop.textinputformat.record.delimiter", "\\u0003"); Also you don't need to write this: Configuration config =newConfiguration(); config.set("textinputformat.record.delimiter","\\u0003"); Actually none of Spark code will use this Configuration.

sshao · ‎12-24-2015

Basically Yarn and HDFS are required if you want to run spark shell on Yarn, for others it depends on your workload.

Online	Offline
Last Visited	‎12-12-2017 01:35 AM

Member Since	‎09-24-2015 11:47 PM
Last Visited	‎12-12-2017 01:35 AM
Posts	11
Kudos received	12

Cloudera Community

Re: Questions Around Spark Cache/spillage to the d...

How to Submit Spark Application through Livy REST ...

Re: Sandbox HDP2.5 Activate Spark 2.0.0

Re: How to install and run Spark 2.0 on HDP 2.5 Sa...

Re: Questions Around Spark Cache/spillage to the d...

Re: Spark dynamic-allocation dont work

Re: Spark dynamic-allocation dont work

Re: Spark Streaming does not seem to recognize alt...

Re: Spark Streaming does not seem to recognize alt...

Re: Unnecessary daemons for spark-shell