Community Articles

subratadas · ‎08-12-2021

In CDP Public Cloud, Data Hub is a service for launching and managing workload clusters. Data Hub clusters strongly resemble traditional (CDH/HDP) clusters. This makes Data Hub clusters an attractive destination for migrating existing workloads to the cloud. In CDP Data Hub, clusters are secure by default. Among other things, this means that perimeter security is provided by Apache Knox Gateway. For remote clients, this means that the available cluster services are only accessible by way of secure endpoints.

In this article, I will walk through how to submit Spark jobs to a Data Hub cluster from a remote client, and I will also provide examples demonstrating how to use the API.

Apache Spark provides a spark-submit script, which is used to launch applications on a cluster. The spark-submit script depends on having network access to the cluster manager (YARN).

A remote client is one that is not deployed on a target Data Hub cluster node, such as a Gateway node. To submit a Spark job, a remote client must use the Apache Livy Service Endpoint. Livy enables interaction with a Spark cluster over a REST interface.

Fortunately, the Livy API client action for submitting jobs resembles the spark-submit script. This obvious difference is that spark-submit is a command-line tool. A job can be defined by the command line parameters (and configuring defaults). With Livy, the client submits the job by issuing an HTTP POST request. The body of the request is JSON, which defines the job in almost the exact same way. Here they are side-by-side:

$./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--jars a.jar,b.jar \

--pyFiles a.py,b.py \

--files foo.txt,bar.txt \

--archives foo.zip,bar.zip \

--master yarn \

--deploy-mode cluster\

--driver-memory 10G \

--driver-cores 1 \

--executor-memory 20G \

--executor-cores 3 \

--num-executors 50 \

--queue default \

--name test \

--proxy-user foo \

--conf spark.jars.packages=xxx \

/path/to/examples.jar \

1000

POST /<endpoint>/batches ... {

"className":

"org.apache.spark.examples.SparkPi,

“jars”:[“a.jar”,”b.jar”],

“pyFiles”:[“a.pi”,”b.py”],

“files”:[“foo.txt”,”bar.txt”],

“archives”:[“foo.zip”,”bar.zip”],

“driverMemory”:”10G”,

“driverCores”:1,

“executorMemory”:”20G”,

“executorCores”:3,

“numExecutors”:50,

“queue”:”default”,

“name”:”test”,

“proxyUser”:”foo”,

“conf”{“spark.jars.packages”:”xxx”,

“file”:”s3a:///path/examples.jar”,

"args": [1000],

}

The Livy API flow is fairly simple.

Client: issues POST HTTP request
Livy: submits Spark job to YARN and gets a job ID
Livy: returns “Location” (containing job ID) in the HTTP response header
Optionally - the client can poll for job status
Client: issues GET HTTP request specifying the “Location” received in previous job submission and gets back the application status.

Let’s dive into a real example using cURL. Your choice for authentication will depend on your own environment, but for this example, basic authentication is used.

Pro tip: (for development only) you can specify the username and password in the URL:

https://chris:opensesame@pse-...a.site/pse-de/cdp-proxy-api/livy/batches

POST to the Livy Endpoint + '/batches'

curl -i --location --request POST 'https://pse-de--...--en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches' \
--header 'Authorization: Basic Y2h...M6tMTAyOQ==' \
--header 'Content-Type: application/json' \
--data-raw '{
    "file": "s3a://pse-7-env/.../imports/cde-demo/Data_Extraction_Over_150k.py",
    "driverMemory": "2G",
    "driverCores": 1,
    "executorCores": 2,
    "executorMemory": "4G",
    "numExecutors": 3,
    "queue": "default",
    "conf": {
        "kind": "pyspark"
    }
}'

Receive HTTP response header 'Location' (see --------> Location: /batches/240 <----------)

HTTP/1.1 201 Created
Server: nginx
Date: Thu, 12 Aug 2021 15:43:04 GMT
Content-Type: application/json;charset=utf-8
Content-Length: 198
Connection: keep-alive
Set-Cookie: KNOXSESSIONID=nodexxxxxxxxxx2o61kcgrsnn6s4fa2.node0; Path=/pse-de/cdp-proxy-api; Secure; HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/pse-de/cdp-proxy-api; Max-Age=0; Expires=Wed, 11-Aug-2021 15:43:04 GMT; SameSite=lax
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
                    -------->  Location: /batches/240 <----------

{
    "id": 240,
    "name": null,
    "owner": "chris",
    "state": "starting",
    "appId": null,
    "appInfo": {
        "driverLogUrl": null,
        "sparkUiUrl": null,
        "executorLogUrls": null
    },
    "log": [
        "stdout: ",
        "\nstderr: ",
        "\nYARN Diagnostics: "
    ]
}

Check Job Status using Location in GET .../batches/240

curl --location --request GET 'https://pse-de-master0.pse-7-en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches/240' \
--header 'Authorization: Basic Y2hyaX...MTAyOQ==' \
--header 'Cookie: KNOXSESSIONID=node019j8ai8hwv6dp13rbzd9sngkbp3.node0'

See Status is set in the Response Body: "state":"starting"

{
    "id": 241,
    "name": null,
    "owner": "chris",
    "state": "starting",
    "appId": null,
    "appInfo": {
        "driverLogUrl": null,
        "sparkUiUrl": null,
        "executorLogUrls": null
    },
    "log": [
        "21/08/12 15:57:55 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).",
...
        "\nYARN Diagnostics: "
    ]
}

Repeat Step 3 as necessary

GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
    "id": 240,
    "name": null,
    "owner": "chris",
    "state": "starting",
    "appId": null,
...
}
GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
    "id": 240,
    "name": null,
    "owner": "chris",
    "state": "running",
    "appId": "application_1628516987381_0021",
    "appInfo": {...
}
GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
{
    "id": 240,
    "name": null,
    "owner": "chris",
    "state": "success",
    "appId": "application_1628516987381_0021",
    "appInfo": {...
}

Check out Tutorial: Using CLI-API to Automate Access to Cloudera Data Engineering if you are interested in learning how to submit Spark jobs to Cloudera Data Engineering (CDE) on CDP using command-line interface (cde CLI) and RESTful APIs.

Apache Livy - REST API

Apache Knox

Apache Spark - Submitting Applications

Cloudera Community

Community Articles

Submit a Spark Job to CDP Data Hub using the Livy REST API

Apache Knox

Apache Spark

Cloudera Data Platform (CDP)