Assume the cluster is kerberized and the only access is Knox. Further assume that Knox uses Basic Authentication and we have user and password of the user to start the Spark job.
The overall idea is to call
curl with <<user>>:<<password>> as Basic Authentication
==> Knox (verifying user:password against LDAP or AD)
==> Resource Manager (YARN REST API) with kerberos principal
1) Create and distribute a keytab
The user that should run the spark job also needs to have a kerberos principal. For this principle create a keytab on one machine:
[root@CLUSTER-HOST ~]$ kinit <<primaryName>>@<<REALM>> \
-k -t /etc/security/keytabs/<<primaryName>>.keytab
# There must be no password prompt!
[root@KDC-HOST ~]$ klist -l
# Principal name Cache name
# -------------- ----------
# <<primaryName>>@<<REALM>> FILE:/tmp/krb5cc_<<####>>
2) Test connection from the workstation outside the cluster
The following properties need to be added to the command attribute (before org.apache.spark.deploy.yarn.ApplicationMaster) compared to Starting Spark jobs directly via YARN REST API via YARN REST API:
credentials_4b023f93-fbde-48ff-b2c8-516251aeed52 is just a unique filename and the file does not need to exist. Concatenate "credentials" with an UUID4. This is the trigger for Spark to start a Delegation Token refresh thread.