Created on 08-12-2021 10:14 AM - edited on 08-16-2021 08:58 PM by subratadas
In CDP Public Cloud, Data Hub is a service for launching and managing workload clusters. Data Hub clusters strongly resemble traditional (CDH/HDP) clusters. This makes Data Hub clusters an attractive destination for migrating existing workloads to the cloud. In CDP Data Hub, clusters are secure by default. Among other things, this means that perimeter security is provided by Apache Knox Gateway. For remote clients, this means that the available cluster services are only accessible by way of secure endpoints.
In this article, I will walk through how to submit Spark jobs to a Data Hub cluster from a remote client, and I will also provide examples demonstrating how to use the API.
Apache Spark provides a spark-submit script, which is used to launch applications on a cluster. The spark-submit script depends on having network access to the cluster manager (YARN).
A remote client is one that is not deployed on a target Data Hub cluster node, such as a Gateway node. To submit a Spark job, a remote client must use the Apache Livy Service Endpoint. Livy enables interaction with a Spark cluster over a REST interface.
Fortunately, the Livy API client action for submitting jobs resembles the spark-submit script. This obvious difference is that spark-submit is a command-line tool. A job can be defined by the command line parameters (and configuring defaults). With Livy, the client submits the job by issuing an HTTP POST request. The body of the request is JSON, which defines the job in almost the exact same way. Here they are side-by-side:
$./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --jars a.jar,b.jar \ --pyFiles a.py,b.py \ --files foo.txt,bar.txt \ --archives foo.zip,bar.zip \ --master yarn \ --deploy-mode cluster\ --driver-memory 10G \ --driver-cores 1 \ --executor-memory 20G \ --executor-cores 3 \ --num-executors 50 \ --queue default \ --name test \ --proxy-user foo \ --conf spark.jars.packages=xxx \ /path/to/examples.jar \ 1000
|
POST /<endpoint>/batches ... { "className": "org.apache.spark.examples.SparkPi, “jars”:[“a.jar”,”b.jar”], “pyFiles”:[“a.pi”,”b.py”], “files”:[“foo.txt”,”bar.txt”], “archives”:[“foo.zip”,”bar.zip”], “driverMemory”:”10G”, “driverCores”:1, “executorMemory”:”20G”, “executorCores”:3, “numExecutors”:50, “queue”:”default”, “name”:”test”, “proxyUser”:”foo”, “conf”{“spark.jars.packages”:”xxx”, “file”:”s3a:///path/examples.jar”, "args": [1000], }
|
The Livy API flow is fairly simple.
Let’s dive into a real example using cURL. Your choice for authentication will depend on your own environment, but for this example, basic authentication is used.
Pro tip: (for development only) you can specify the username and password in the URL:
https://chris:opensesame@pse-...a.site/pse-de/cdp-proxy-api/livy/batches
curl -i --location --request POST 'https://pse-de--...--en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches' \
--header 'Authorization: Basic Y2h...M6tMTAyOQ==' \
--header 'Content-Type: application/json' \
--data-raw '{
"file": "s3a://pse-7-env/.../imports/cde-demo/Data_Extraction_Over_150k.py",
"driverMemory": "2G",
"driverCores": 1,
"executorCores": 2,
"executorMemory": "4G",
"numExecutors": 3,
"queue": "default",
"conf": {
"kind": "pyspark"
}
}'
HTTP/1.1 201 Created
Server: nginx
Date: Thu, 12 Aug 2021 15:43:04 GMT
Content-Type: application/json;charset=utf-8
Content-Length: 198
Connection: keep-alive
Set-Cookie: KNOXSESSIONID=nodexxxxxxxxxx2o61kcgrsnn6s4fa2.node0; Path=/pse-de/cdp-proxy-api; Secure; HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/pse-de/cdp-proxy-api; Max-Age=0; Expires=Wed, 11-Aug-2021 15:43:04 GMT; SameSite=lax
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
--------> Location: /batches/240 <----------
{
"id": 240,
"name": null,
"owner": "chris",
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null,
"executorLogUrls": null
},
"log": [
"stdout: ",
"\nstderr: ",
"\nYARN Diagnostics: "
]
}
curl --location --request GET 'https://pse-de-master0.pse-7-en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches/240' \
--header 'Authorization: Basic Y2hyaX...MTAyOQ==' \
--header 'Cookie: KNOXSESSIONID=node019j8ai8hwv6dp13rbzd9sngkbp3.node0'
{
"id": 241,
"name": null,
"owner": "chris",
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null,
"executorLogUrls": null
},
"log": [
"21/08/12 15:57:55 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).",
...
"\nYARN Diagnostics: "
]
}
Repeat Step 3 as necessaryGET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
"id": 240,
"name": null,
"owner": "chris",
"state": "starting",
"appId": null,
...
}
GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
"id": 240,
"name": null,
"owner": "chris",
"state": "running",
"appId": "application_1628516987381_0021",
"appInfo": {...
}
GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
"id": 240,
"name": null,
"owner": "chris",
"state": "success",
"appId": "application_1628516987381_0021",
"appInfo": {...
}
Check out Tutorial: Using CLI-API to Automate Access to Cloudera Data Engineering if you are interested in learning how to submit Spark jobs to Cloudera Data Engineering (CDE) on CDP using command-line interface (cde CLI) and RESTful APIs.