Created 09-26-2018 09:38 PM
Hi;
At the first day we started spark thrift with keytab file and principal, we could use beeline to connect database and could get data from tables.
[spark@xxx] $SPARK_HOME/sbin/start-thriftserver.sh --master yarn-client --keytab /keytab/spark_thrift.keytab --principal thriftuser/thrift.server.org@THRIFT.REALMS.ORG --hiveconf hive.server2.thrift.port=10102 --conf spark.hadoop.fs.hdfs.impl.disable.cache=true --hiveconf hive.server2.authetication.kerberos.pricipal=thriftuser/thrift.server.org@THRIFT.REALMS.ORG --hiveconf hive.server2.authetication.kerberos.keytab /keytab/spark_thrift.keytab --hiveconf hive.server2.logging.operation.enabled=true
We renewed the principal every 18hours.
while (true) do kinit -kt /keytab/spark_thrift.keytab thriftuser/thrift.server.org@THRIFT.REALMS.ORG sleep 18h done &
The first day we started spark thrift, we could use beeline normally.
[spark@xxx]beeline beeline> !connect jdbc:hive2://hive.server.org:10102/database;principal=thriftuser/thrift.server.org@THRIFT.REALMS.ORG beeline> select count(1) from table; ###This would show the table details.
But one day left we tried get data again. it would throw errors like:
java.lang.ClassCastException: org.apache.hadoop.security.authentication.client.AuthenticationException cannot be cast to java.security.GeneralSecurityException at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:189) at org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388) at org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1381) at org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1451) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:305) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:251) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
What we've checked:
1, We renewed the principal we used every 18 hours.
2, Checked spark thrift log. we found that credentials stored in HDFS were renewed every 15 hours.
18/09/19 16:45:58 INFO Client: Credentials file set to : credentials-xxxxx 18/09/19 16:45:59 INFO Client: To enable the AM to login from keytab, credentials are being copied over to the AM via the YARN secure Distributed Cache. 18/09/19 16:46:10 INFO CredentialUpdater:Scheduling credentials refresh from HDFS in 57588753ms. 18/09/20 08:45:58 INFO CredentialUpdater:Reading new credentials from hdfs://cluster/user/thriftuser/.sparkStaging/application_xxx/credentials-xxyyx 18/09/20 08:45:58 INFO CredentialUpdater:Credentials updated from credentials files. 18/09/20 08:45:58 INFO CredentialUpdater:Scheduling credentials refresh from HDFS in 57588700ms.
3, Checked ranger-kms access log, we found that error code was 403 when decrypt.
xxx.xxx.xxx.xxx - - [20/Sep/2018:10:57:50 +0800] "POST /kms/v1/keyversion/thriftuser_key%400/_eek?eek_op=decrypt HTTP/1.1 403 410"
4, But when we read encrypted data (which hive metadata stored.) directory with the principal active, the data can be read successfully..
[spark@xxx]hdfs dfs -cat /user/thriftuser/test.txt test!
Created 09-26-2018 09:38 PM
thank you!