This is an extension of Starting Spark jobs directly via YARN REST API
Assume the cluster is kerberized and the only access is Knox. Further assume that Knox uses Basic Authentication and we have user and password of the user to start the Spark job.
The overall idea is to call
curl with <<user>>:<<password>> as Basic Authentication ==> Knox (verifying user:password against LDAP or AD) ==> Resource Manager (YARN REST API) with kerberos principal
The user that should run the spark job also needs to have a kerberos principal. For this principle create a keytab on one machine:
[root@KDC-HOST ~]$ kadmin kadmin> xst -k /etc/security/keytabs/<<primaryName>>.keytab <<primaryName>>@<<REALM>>
Use your REALM and an appropriate primaryName of your principle
Then distribute this keytab to all other machines in the cluster, copy to /etc/security/keytabs and set permissions
[root@CLUSTER-HOST ~]$ chown <<user>>:hadoop /etc/security/keytabs/<<primaryName>>.keytab [root@CLUSTER-HOST ~]$ chmod 400 /etc/security/keytabs/<<primaryName>>.keytab
Test the keytab on each machine
[root@CLUSTER-HOST ~]$ kinit <<primaryName>>@<<REALM>> \ -k -t /etc/security/keytabs/<<primaryName>>.keytab # There must be no password prompt! [root@KDC-HOST ~]$ klist -l # Principal name Cache name # -------------- ---------- # <<primaryName>>@<<REALM>> FILE:/tmp/krb5cc_<<####>>
a) HDFS:
[MacBook simple-project]$ curl -s -k -u '<<user>>:<<password>>' \ https://$KNOX_SERVER:8443/gateway/default/webhdfs/v1/?op=GETFILESTATUS # { # "FileStatus": { # "accessTime": 0, # "blockSize": 0, # "childrenNum": 9, # "fileId": 16385, # "group": "hdfs", # "length": 0, # "modificationTime": 1458070072105, # "owner": "hdfs", # "pathSuffix": "", # "permission": "755", # "replication": 0, # "storagePolicy": 0, # "type": "DIRECTORY" # } # }
b) YARN:
[MacBook simple-project]$ curl -s -k -u '<<user>>:<<password>>' -d '' \ https://$KNOX_SERVER:8443/gateway/default/resourcemanager/v1/cluster/apps/new-application # { # "application-id": "application_1460654399208_0004", # "maximum-resource-capability": { # "memory": 8192, # "vCores": 3 # } # }
The following values need to changed added compared to Starting Spark jobs directly via YARN REST API:
spark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytabs spark.history.kerberos.principal=spark-Demo@<<REALM>> spark.yarn.keytab=/etc/security/keytabs/<<primaryName>>.keytab spark.yarn.principal=<<primaryName>>@<<REALM>>
The following properties need to be added to the command attribute (before org.apache.spark.deploy.yarn.ApplicationMaster) compared to Starting Spark jobs directly via YARN REST API via YARN REST API:
-Dspark.yarn.keytab=/etc/security/keytabs/<<primaryName>>.keytab \ -Dspark.yarn.principal=<<primaryName>>@<<REALM>> \ -Dspark.yarn.credentials.file=hdfs://<<name-node>>:8020/tmp/simple-project/credentials_4b023f93-fbde-48ff-b2c8-516251aeed52 \ -Dspark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytabs \ -Dspark.history.kerberos.principal=spark-Demo@<<REALM>> \ -Dspark.history.kerberos.enabled=true
credentials_4b023f93-fbde-48ff-b2c8-516251aeed52 is just a unique filename and the file does not need to exist. Concatenate "credentials" with an UUID4. This is the trigger for Spark to start a Delegation Token refresh thread.
Details see attachment.
Same as in Starting Spark jobs directly via YARN REST API via YARN REST API, however one needs to provide -u <<user>>:<<password>> to the curl command to authenticate with Knox.
After being authenticated by Knox, the keytabs for the following steps will be taken by YARN and Spark from the properties and job json file.
After finishing the job successfully, the log aggregation status will continue to be "RUNNING" until it gets a "TIME_OUT"
Again, more details and a python script to ease the whole process can be found in Spark-Yarn-REST-API Repo
Any comment to make this process easier is highly appreciated ...