Does anyone know how to let mr Spark talk to HBase on a secured cluster?
I have a Keberized Hadoop cluster (HDP 2.5) and want to scan HBase tables from Spark using newAPIHadoopRDD!
Spark application on local mode can easily authenticate to AD using a keytab and communicate with HBase!
When I run it on YARN, driver can authenticate with AD and get the tgt in 2 ways:
But executors fail and can't get the Kerberos tgt although the keytab is available to Spark on all the nodes!
The problem is executors are handled by newAPIHadoopRDD and I can't find a way to make them use my user and its headless keytab!
Then I get the following famous exception on all the executors:
avax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:611) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:156) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:737) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:734) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:734) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:887) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:856) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741) at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201) at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:338) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
You likely do not want to distribute your keytab to all nodes as this is the same principle as lending out the keys to your house and trusting no one will use them maliciously. If that would ever be compromised, you now have to change all of your locks.
HBase supports delegation token authentication which lets you acquire a short-lived "password" using your Kerberos credentials before submitting your task. This short-lived password can be passed around with greater confidence that, if compromised, it can be revoked without affecting your 'master' keytab.
In the pure-MapReduce context, you can do this via invoking
My understanding is that this works similarly for Spark, but I don't have a lot of experience here. Hope it helps.
Thanks for the answer @Josh Elser
My problem is not whether this is the best practice for handling authentication or not
First I want my code work! I'm trying to find a way to make newAPIHadoopRDD authenticate to AD on the executors but there is no success yet!
shc has it's own problems and I couldn't find it useful yet! please let me know if you've tried it?
Here are the problems I faced:
1- The shc tech preview is released with HDP 2.5 but the Spark versions are slightly different! HDP 2.5 has Spark 1.6.2 but shc is built for 1.6.1; I actually tried to upgrade the Spark and rebuilt shc but it failed in few test cases!
2- Spark 1.6.2 it built based on Scala 2.10 as you can find http://repo.hortonworks.com/content/repositories/releases/org/apache/spark/spark-core_2.10
but shc for Spark 1.6.1 is built based on Scala 2.11 which sounds weird!
The above link looks like Scala version is 2.10 but if you try it you will notice it was built and released based on 2.11!!!
Sorry about the bad builds. We are working through the automation process that builds different versions of SHC.
My comment was mainly about the configuration section in the README for SHC for secure clusters. That is independent of SHC. Its just instructions on how to set up Spark to access HBase tokens.