Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark access remote HDFS in cross realm trust setup


Spark access remote HDFS in cross realm trust setup

Master Collaborator


 in  a two cluster environment where each cluster has its own KDC and between those KDC a trust is configured I cannot read data via Spark. I am missing some property of the spark-shell or spark-submit?


Local HDFS: devhanameservice

Remote HDFS: hanameservice


Running a hdfs ls from dev and listing prod works fine:

[centos@<dev-gateway> ~]$ hdfs dfs -ls hdfs://hanameservice/tmp
Found 6 items
d---------   - hdfs   supergroup          0 2019-03-14 11:47 hdfs://hanameservice/tmp/.cloudera_health_monitoring_canary_files

But trying to access the remote file in the remote HDFS in spark-shell returns this:

[centos@<dev-gateway> ~]$ spark2-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://<dev-gateway>.eu-west-1.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1552545238536_0261).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val t = sc.textFile("hdfs://hanameservice/tmp/external/test/file.csv")
t: org.apache.spark.rdd.RDD[String] = hdfs://hanameservice/tmp/external/test/file.csv MapPartitionsRDD[1] at textFile at <console>:24

scala> t.count()
[Stage 0:>                                                         (0 + 1) / 28]19/03/14 11:45:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <worker-node>, executor 28): Failed on local exception: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "<worker-node>/"; destination host is: "<remote-name-node>":8020;
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
        at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
        at java.lang.reflect.Method.invoke(
        at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(
        at org.apache.hadoop.hdfs.DFSInputStream.<init>(
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
        at org.apache.hadoop.mapred.LineRecordReader.<init>(
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
        at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.executor.Executor$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$
Caused by: Client cannot authenticate via:[TOKEN, KERBEROS]

I am able to run mapreduce jobs with this property:


Is it something I should put to spark settings? And if yes, how?



Re: Spark access remote HDFS in cross realm trust setup

Master Collaborator
Any hints? Cloudera guys? Thanks
Don't have an account?
Coming from Hortonworks? Activate your account here