Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark-submit --proxy-user eror

avatar
Super Guru

Hi I am running a job against a secured spark cluster and I have a valid keytab and proxuuser settings for this user defined in core-site.xml.

When I run the job, I get the following error. Any idea?

16/08/01 01:06:55 INFO yarn.Client: Attempting to login to the Kerberos using principal: <principal@REALM.COM> and keytab: <path to keytab file>

16/08/01 01:06:55 INFO client.RMProxy: Connecting to ResourceManager at <host>/<IP>:8032

16/08/01 01:06:55 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers

16/08/01 01:06:55 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (3072 MB per container)

16/08/01 01:06:55 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead

16/08/01 01:06:55 INFO yarn.Client: Setting up container launch context for our AM
16/08/01 01:06:55 INFO yarn.Client: Setting up the launch environment for our AM container
16/08/01 01:06:56 INFO yarn.Client: Credentials file set to: credentials-227f50ae-ab28-4b37-823d-15b3d723185a16/08/01 01:06:56 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: hdfs://<host>:8020/user/<proxyuser>/.sparkStaging/application_1469977170124_000416/08/01 01:06:56 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 119 for <proxyuser> on 10.0.0.10:802016/08/01 01:06:56 

ERROR spark.SparkContext: Error initializing SparkContext.org.apache.hadoop.security.AccessControlException: <proxyuser> tries to renew a token with renewer <kerberos principal>

at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:484)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7503)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:549)at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:673)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:984)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:415)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
1 ACCEPTED SOLUTION

avatar
Super Guru
@Vipin Rathor

I figured this one out. Here is the thing. spark doesn't allow to submit keytab and principal with proxy-user. The reason is you need keytab and principal when you are running long running jobs. In this case, for long running jobs, the keytab will be copied to application master staging area and this keytab and principal will be used to renew delegation tokens required for HDFS. This enables the application to continue working with any security issue. Remember this feature is explicitly for long running applications. See details here under YARN mode, second paragraph.

Imagine if all application users logging into my applications can proxy to my keytab user since the yarn application will be submitted as the proxy-user but it will copy the keytab. This means those users can read the contents of that keytab. This is a huge security flaw. And that's why what I was trying above is not allowed.

I have to do what Hive does to run "spark-submit". Basically kinit before submitting my application and then provide a proxy-user. So here is how I solved it.

kinit <principal>@<REALM> -k -t <mykeytab file>

spark-submit <all my options include --proxy-user <my proxy user>>

This way my proxy-user cannot read keytab contents and is only used as a proxy-user. My application is not long running (like more than 7 days which is the usual lifetime of a TGT), so I am fine.

View solution in original post

3 REPLIES 3

avatar
Guru

@mqureshi I'd like to see the values that you are using, like what <principal> , <keytab> and <proxyuser> are set to? Similarly what value are you seeing in <kerberos principal> in the log?

avatar
Super Guru
@Vipin Rathor

I figured this one out. Here is the thing. spark doesn't allow to submit keytab and principal with proxy-user. The reason is you need keytab and principal when you are running long running jobs. In this case, for long running jobs, the keytab will be copied to application master staging area and this keytab and principal will be used to renew delegation tokens required for HDFS. This enables the application to continue working with any security issue. Remember this feature is explicitly for long running applications. See details here under YARN mode, second paragraph.

Imagine if all application users logging into my applications can proxy to my keytab user since the yarn application will be submitted as the proxy-user but it will copy the keytab. This means those users can read the contents of that keytab. This is a huge security flaw. And that's why what I was trying above is not allowed.

I have to do what Hive does to run "spark-submit". Basically kinit before submitting my application and then provide a proxy-user. So here is how I solved it.

kinit <principal>@<REALM> -k -t <mykeytab file>

spark-submit <all my options include --proxy-user <my proxy user>>

This way my proxy-user cannot read keytab contents and is only used as a proxy-user. My application is not long running (like more than 7 days which is the usual lifetime of a TGT), so I am fine.

avatar

@mqureshi I have a similar problem but in my case I dont want to create separate tickets for application users. My requirement is that all services in Hadoop should be accessed via KNOX as proxy user. Knox would have taken care of authentication seperately. So in my case all authenticated application users eg user1, user2 etc.. should be able to run jobs with knox proxy user. This link talks exactly of the same concept: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/Superusers.html#Use_Case
Here the idea is not to have seperate kerberos credentials for each individual application users.

Any thoughts from your side, on what would be required for this?