Support Questions
Find answers, ask questions, and share your expertise

Spark thrift got general security issue

Explorer

Hi;

At the first day we started spark thrift with keytab file and principal, we could use beeline to connect database and could get data from tables.

[spark@xxx] $SPARK_HOME/sbin/start-thriftserver.sh --master yarn-client --keytab /keytab/spark_thrift.keytab --principal thriftuser/thrift.server.org@THRIFT.REALMS.ORG --hiveconf hive.server2.thrift.port=10102 --conf spark.hadoop.fs.hdfs.impl.disable.cache=true --hiveconf hive.server2.authetication.kerberos.pricipal=thriftuser/thrift.server.org@THRIFT.REALMS.ORG --hiveconf hive.server2.authetication.kerberos.keytab /keytab/spark_thrift.keytab --hiveconf hive.server2.logging.operation.enabled=true

We renewed the principal every 18hours.

while (true) do
  kinit -kt /keytab/spark_thrift.keytab thriftuser/thrift.server.org@THRIFT.REALMS.ORG
  sleep 18h
done &

The first day we started spark thrift, we could use beeline normally.

[spark@xxx]beeline
beeline> !connect jdbc:hive2://hive.server.org:10102/database;principal=thriftuser/thrift.server.org@THRIFT.REALMS.ORG

beeline> select count(1) from table;

###This would show the table details.

But one day left we tried get data again. it would throw errors like:

	java.lang.ClassCastException: 
	org.apache.hadoop.security.authentication.client.AuthenticationException 
	cannot be cast to java.security.GeneralSecurityException 
	        at 
org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:189) 
	        at 
org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388) 
	        at 
	org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1381) 
	        at 
	org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1451) 
	        at 
	org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:305) 
	        at 
	org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) 
	        at 
	org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
	        at 
	org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) 
	        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) 
	        at 
	org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109) 
	        at 
	org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) 
	        at 
	org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252) 
	        at 
	org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:251) 
	        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) 
	        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) 
	        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
	        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
	        at 
	org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
	        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
	        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
	        at 
	org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
	        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
	        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
	        at 
	org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
	        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
	        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
	        at 
	org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
	        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
	        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
	        at 
	org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
	        at org.apache.spark.scheduler.Task.run(Task.scala:99) 
	        at 
	org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
	        at 
	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
	        at 
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
	        at java.lang.Thread.run(Thread.java:748) 

What we've checked:

1, We renewed the principal we used every 18 hours.

2, Checked spark thrift log. we found that credentials stored in HDFS were renewed every 15 hours.

18/09/19 16:45:58 INFO Client: Credentials file set to : credentials-xxxxx
18/09/19 16:45:59 INFO Client: To enable the AM to login from keytab, credentials are being copied over to the AM via the YARN secure Distributed Cache.
18/09/19 16:46:10 INFO CredentialUpdater:Scheduling credentials refresh from HDFS in 57588753ms.
18/09/20 08:45:58 INFO CredentialUpdater:Reading new credentials from hdfs://cluster/user/thriftuser/.sparkStaging/application_xxx/credentials-xxyyx
18/09/20 08:45:58 INFO CredentialUpdater:Credentials updated from credentials files.
18/09/20 08:45:58 INFO CredentialUpdater:Scheduling credentials refresh from HDFS in 57588700ms.

3, Checked ranger-kms access log, we found that error code was 403 when decrypt.

xxx.xxx.xxx.xxx - - [20/Sep/2018:10:57:50 +0800] "POST /kms/v1/keyversion/thriftuser_key%400/_eek?eek_op=decrypt HTTP/1.1 403 410"

4, But when we read encrypted data (which hive metadata stored.) directory with the principal active, the data can be read successfully..

[spark@xxx]hdfs dfs -cat /user/thriftuser/test.txt

test!
1 REPLY 1

New Contributor

thank you!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.