Community Articles

Find and share helpful community-sourced technical articles.
avatar

This is an extension of Starting Spark jobs directly via YARN REST API

Assume the cluster is kerberized and the only access is Knox. Further assume that Knox uses Basic Authentication and we have user and password of the user to start the Spark job.

The overall idea is to call

curl with <<user>>:<<password>> as Basic Authentication ==> Knox (verifying user:password against LDAP or AD) ==> Resource Manager (YARN REST API) with kerberos principal

1) Create and distribute a keytab

The user that should run the spark job also needs to have a kerberos principal. For this principle create a keytab on one machine:

[root@KDC-HOST ~]$ kadmin
kadmin> xst -k /etc/security/keytabs/<<primaryName>>.keytab <<primaryName>>@<<REALM>>

Use your REALM and an appropriate primaryName of your principle

Then distribute this keytab to all other machines in the cluster, copy to /etc/security/keytabs and set permissions

[root@CLUSTER-HOST ~]$ chown <<user>>:hadoop /etc/security/keytabs/<<primaryName>>.keytab
[root@CLUSTER-HOST ~]$ chmod 400 /etc/security/keytabs/<<primaryName>>.keytab

Test the keytab on each machine

[root@CLUSTER-HOST ~]$ kinit <<primaryName>>@<<REALM>> \
                       -k -t /etc/security/keytabs/<<primaryName>>.keytab
# There must be no password prompt!

[root@KDC-HOST ~]$ klist -l 
# Principal name                 Cache name
# --------------                 ----------
# <<primaryName>>@<<REALM>>      FILE:/tmp/krb5cc_<<####>>

2) Test connection from the workstation outside the cluster

a) HDFS:

[MacBook simple-project]$ curl -s -k -u '<<user>>:<<password>>' \
                          https://$KNOX_SERVER:8443/gateway/default/webhdfs/v1/?op=GETFILESTATUS
# {
#   "FileStatus": {
#     "accessTime": 0,
#     "blockSize": 0,
#     "childrenNum": 9,
#     "fileId": 16385,
#     "group": "hdfs",
#     "length": 0,
#     "modificationTime": 1458070072105,
#     "owner": "hdfs",
#     "pathSuffix": "",
#     "permission": "755",
#     "replication": 0,
#     "storagePolicy": 0,
#     "type": "DIRECTORY"
#   }
# }

b) YARN:

[MacBook simple-project]$ curl -s -k -u '<<user>>:<<password>>' -d '' \
                          https://$KNOX_SERVER:8443/gateway/default/resourcemanager/v1/cluster/apps/new-application
# {
#   "application-id": "application_1460654399208_0004",
#   "maximum-resource-capability": {
#     "memory": 8192,
#     "vCores": 3
#   }
# }

3) Changes to spark-yarn.properties

The following values need to changed added compared to Starting Spark jobs directly via YARN REST API:

spark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytabs
spark.history.kerberos.principal=spark-Demo@<<REALM>>
spark.yarn.keytab=/etc/security/keytabs/<<primaryName>>.keytab
spark.yarn.principal=<<primaryName>>@<<REALM>>

4) Changes to spark-yarn.json

The following properties need to be added to the command attribute (before org.apache.spark.deploy.yarn.ApplicationMaster) compared to Starting Spark jobs directly via YARN REST API via YARN REST API:

-Dspark.yarn.keytab=/etc/security/keytabs/<<primaryName>>.keytab \
-Dspark.yarn.principal=<<primaryName>>@<<REALM>> \
-Dspark.yarn.credentials.file=hdfs://<<name-node>>:8020/tmp/simple-project/credentials_4b023f93-fbde-48ff-b2c8-516251aeed52 \
-Dspark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytabs \
-Dspark.history.kerberos.principal=spark-Demo@<<REALM>> \
-Dspark.history.kerberos.enabled=true

credentials_4b023f93-fbde-48ff-b2c8-516251aeed52 is just a unique filename and the file does not need to exist. Concatenate "credentials" with an UUID4. This is the trigger for Spark to start a Delegation Token refresh thread.

Details see attachment.

5) Submit a Job

Same as in Starting Spark jobs directly via YARN REST API via YARN REST API, however one needs to provide -u <<user>>:<<password>> to the curl command to authenticate with Knox.

After being authenticated by Knox, the keytabs for the following steps will be taken by YARN and Spark from the properties and job json file.

6) Know Issue

After finishing the job successfully, the log aggregation status will continue to be "RUNNING" until it gets a "TIME_OUT"

7) More details

Again, more details and a python script to ease the whole process can be found in Spark-Yarn-REST-API Repo

Any comment to make this process easier is highly appreciated ...

4,750 Views