Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

In CDP Public Cloud, Data Hub is a service for launching and managing workload clusters. Data Hub clusters strongly resemble traditional (CDH/HDP) clusters. This makes Data Hub clusters an attractive destination for migrating existing workloads to the cloud. In CDP Data Hub, clusters are secure by default.  Among other things, this means that perimeter security is provided by Apache Knox Gateway. For remote clients, this means that the available cluster services are only accessible by way of secure endpoints. 

endpoints.png

In this article, I will walk through how to submit Spark jobs to a Data Hub cluster from a remote client, and I will also provide examples demonstrating how to use the API.

Apache Spark provides a spark-submit script, which is used to launch applications on a cluster. The spark-submit script depends on having network access to the cluster manager (YARN).

A remote client is one that is not deployed on a target Data Hub cluster node, such as a Gateway node. To submit a Spark job, a remote client must use the Apache Livy Service Endpoint. Livy enables interaction with a Spark cluster over a REST interface.

 

cbove_0-1628780628407.png

Fortunately, the Livy API client action for submitting jobs resembles the spark-submit script. This obvious difference is that spark-submit is a command-line tool. A job can be defined by the command line parameters (and configuring defaults). With Livy, the client submits the job by issuing an HTTP POST request. The body of the request is JSON, which defines the job in almost the exact same way. Here they are side-by-side:

$./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--jars a.jar,b.jar \

--pyFiles a.py,b.py \

--files foo.txt,bar.txt \

--archives foo.zip,bar.zip \

--master yarn \

--deploy-mode cluster\

--driver-memory 10G \

--driver-cores 1 \

--executor-memory 20G \

--executor-cores 3 \

--num-executors 50 \

--queue default \

--name test \

--proxy-user foo \

--conf spark.jars.packages=xxx \

/path/to/examples.jar \

1000

 

 

 

POST /<endpoint>/batches ...  {

"className":

"org.apache.spark.examples.SparkPi,

“jars”:[“a.jar”,”b.jar”],

“pyFiles”:[“a.pi”,”b.py”],

“files”:[“foo.txt”,”bar.txt”],

“archives”:[“foo.zip”,”bar.zip”],



“driverMemory”:”10G”,

“driverCores”:1,

“executorMemory”:”20G”,

“executorCores”:3,

“numExecutors”:50,

“queue”:”default”,

“name”:”test”,

“proxyUser”:”foo”,

“conf”{“spark.jars.packages”:”xxx”,

“file”:”s3a:///path/examples.jar”,

"args": [1000],

}

 

 

The Livy API flow is fairly simple.  

  • Client: issues POST HTTP request
  • Livy: submits Spark job to YARN and gets a job ID
  • Livy: returns “Location” (containing job ID) in the HTTP response header
    Optionally - the client can poll for job status 
  • Client: issues GET HTTP request specifying the “Location” received in previous job submission and gets back the application status.

Let’s dive into a real example using cURL. Your choice for authentication will depend on your own environment, but for this example, basic authentication is used. 

Pro tip: (for development only) you can specify the username and password in the URL:

https://chris:opensesame@pse-...a.site/pse-de/cdp-proxy-api/livy/batches
  1. POST to the Livy Endpoint + '/batches' cbove_1-1628781933731.png
    curl -i --location --request POST 'https://pse-de--...--en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches' \
    --header 'Authorization: Basic Y2h...M6tMTAyOQ==' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "file": "s3a://pse-7-env/.../imports/cde-demo/Data_Extraction_Over_150k.py",
        "driverMemory": "2G",
        "driverCores": 1,
        "executorCores": 2,
        "executorMemory": "4G",
        "numExecutors": 3,
        "queue": "default",
        "conf": {
            "kind": "pyspark"
        }
    }'
  2. Receive HTTP response header 'Location' (see --------> Location: /batches/240 <----------)
    HTTP/1.1 201 Created
    Server: nginx
    Date: Thu, 12 Aug 2021 15:43:04 GMT
    Content-Type: application/json;charset=utf-8
    Content-Length: 198
    Connection: keep-alive
    Set-Cookie: KNOXSESSIONID=nodexxxxxxxxxx2o61kcgrsnn6s4fa2.node0; Path=/pse-de/cdp-proxy-api; Secure; HttpOnly
    Expires: Thu, 01 Jan 1970 00:00:00 GMT
    Set-Cookie: rememberMe=deleteMe; Path=/pse-de/cdp-proxy-api; Max-Age=0; Expires=Wed, 11-Aug-2021 15:43:04 GMT; SameSite=lax
    X-Content-Type-Options: nosniff
    X-Frame-Options: SAMEORIGIN
    X-XSS-Protection: 1; mode=block
                        -------->  Location: /batches/240 <----------
    
    {
        "id": 240,
        "name": null,
        "owner": "chris",
        "state": "starting",
        "appId": null,
        "appInfo": {
            "driverLogUrl": null,
            "sparkUiUrl": null,
            "executorLogUrls": null
        },
        "log": [
            "stdout: ",
            "\nstderr: ",
            "\nYARN Diagnostics: "
        ]
    }
  3. Check Job Status using Location in GET .../batches/240
    curl --location --request GET 'https://pse-de-master0.pse-7-en.ylcu-atmi.cloudera.site/pse-de/cdp-proxy-api/livy/batches/240' \
    --header 'Authorization: Basic Y2hyaX...MTAyOQ==' \
    --header 'Cookie: KNOXSESSIONID=node019j8ai8hwv6dp13rbzd9sngkbp3.node0'
  4. See Status is set in the Response Body: "state":"starting"
    {
        "id": 241,
        "name": null,
        "owner": "chris",
        "state": "starting",
        "appId": null,
        "appInfo": {
            "driverLogUrl": null,
            "sparkUiUrl": null,
            "executorLogUrls": null
        },
        "log": [
            "21/08/12 15:57:55 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).",
    ...
            "\nYARN Diagnostics: "
        ]
    }
    Repeat Step 3 as necessary
    GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
    {
        "id": 240,
        "name": null,
        "owner": "chris",
        "state": "starting",
        "appId": null,
    ...
    }
    GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
    {
        "id": 240,
        "name": null,
        "owner": "chris",
        "state": "running",
        "appId": "application_1628516987381_0021",
        "appInfo": {...
    }
    GET 'https://pse-d...site/pse-de/cdp-proxy-api/livy/batches/240' ...
    {
        "id": 240,
        "name": null,
        "owner": "chris",
        "state": "success",
        "appId": "application_1628516987381_0021",
        "appInfo": {...
    }

 

 

Check out Tutorial: Using CLI-API to Automate Access to Cloudera Data Engineering if you are interested in learning how to submit Spark jobs to Cloudera Data Engineering (CDE) on CDP using command-line interface (cde CLI) and RESTful APIs.

Apache Livy - REST API 

Apache Knox 

Apache Spark - Submitting Applications 

2,673 Views
0 Kudos