Support Questions

Find answers, ask questions, and share your expertise

Spark HBase issue on secured cluster using newAPIHadoopRDD

avatar
Contributor

Hi,

Does anyone know how to let mr Spark talk to HBase on a secured cluster?

I have a Keberized Hadoop cluster (HDP 2.5) and want to scan HBase tables from Spark using newAPIHadoopRDD!

Spark application on local mode can easily authenticate to AD using a keytab and communicate with HBase!

When I run it on YARN, driver can authenticate with AD and get the tgt in 2 ways:

  1. using --keytab --principal on spark-submit
  2. with the help of UserGroupInformation.loginUserFromKeytabAndReturnUGI

But executors fail and can't get the Kerberos tgt although the keytab is available to Spark on all the nodes!

The problem is executors are handled by newAPIHadoopRDD and I can't find a way to make them use my user and its headless keytab!

Then I get the following famous exception on all the executors:

avax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
        at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:611)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:156)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:737)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:734)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:734)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:887)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:856)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
        at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
        at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:338)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
        at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
        at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
        at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
        at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
1 ACCEPTED SOLUTION

avatar
Super Collaborator

Please see the steps outlined here for accessing Hbase securely via Spark. No code change should be needed in your app for typical use cases.

View solution in original post

6 REPLIES 6

avatar
Super Guru

You likely do not want to distribute your keytab to all nodes as this is the same principle as lending out the keys to your house and trusting no one will use them maliciously. If that would ever be compromised, you now have to change all of your locks.

HBase supports delegation token authentication which lets you acquire a short-lived "password" using your Kerberos credentials before submitting your task. This short-lived password can be passed around with greater confidence that, if compromised, it can be revoked without affecting your 'master' keytab.

In the pure-MapReduce context, you can do this via invoking

TableMapReduceUtil.initCredentials(jobConf);

My understanding is that this works similarly for Spark, but I don't have a lot of experience here. Hope it helps.

avatar
Contributor

Thanks for the answer @Josh Elser

My problem is not whether this is the best practice for handling authentication or not

First I want my code work! I'm trying to find a way to make newAPIHadoopRDD authenticate to AD on the executors but there is no success yet!

avatar
Super Collaborator

Please see the steps outlined here for accessing Hbase securely via Spark. No code change should be needed in your app for typical use cases.

avatar
Contributor

Thanks @bikas

shc has it's own problems and I couldn't find it useful yet! please let me know if you've tried it?

Here are the problems I faced:

1- The shc tech preview is released with HDP 2.5 but the Spark versions are slightly different! HDP 2.5 has Spark 1.6.2 but shc is built for 1.6.1; I actually tried to upgrade the Spark and rebuilt shc but it failed in few test cases!

2- Spark 1.6.2 it built based on Scala 2.10 as you can find http://repo.hortonworks.com/content/repositories/releases/org/apache/spark/spark-core_2.10

but shc for Spark 1.6.1 is built based on Scala 2.11 which sounds weird!

http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/shc-core/1.0.1-1.6-s_2.10/

The above link looks like Scala version is 2.10 but if you try it you will notice it was built and released based on 2.11!!!

avatar
Super Collaborator

Sorry about the bad builds. We are working through the automation process that builds different versions of SHC.

My comment was mainly about the configuration section in the README for SHC for secure clusters. That is independent of SHC. Its just instructions on how to set up Spark to access HBase tokens.

avatar
New Contributor

Hi @Mahan Hosseinzadeh

Did you find any solution for this issue. I am also facing same issue and i am struck for quite some time with any success.