Support Questions
Find answers, ask questions, and share your expertise

Spark 2 accessing HDFS via Knox

Spark 2 accessing HDFS via Knox

New Contributor

In my company we secure HDFS with Knox for external services, so when I am developing locally I need to go through Knox to fetch HDFS files. In our projects we are using Spark 1.6 and we had to implement a custom FileSystem to wrap this Knox access. As I am developing a new project using Spark 2.1, I was wondering if there is an easier way to fetch this HDFS data without implementing a custom file system. What's the right way to do this?

3 REPLIES 3
Highlighted

Re: Spark 2 accessing HDFS via Knox

@Bemjamin Quintino AFAIK there is no OOB solution to use hdfs over knox from spark. I'm not sure if this will be helpful but ... If there are firewall restrictions is there a way perhaps you can run your spark code from an edge node (pushing changes with source control like git) / or perhaps use local docker HDP cluster to develop/test without having to go over knox? By the way, I think the solution you describe is very cleaver!

Highlighted

Re: Spark 2 accessing HDFS via Knox

@Bemjamin Quintino

Spark hive integration should be transparent even when using kerberos. Out of the box the ambari installed spark client conf dir contains a smaller version of hive-site.xml which has the address to the hive metastore. And automatically the spark client (spark-submit, or spark-shell) acquires a delegation token to communicate with the metastore. The above error seems to be due incorrect jars being used. Check you are not bundling or adding extra jars that could cause this conflict. Also, are you using the hortonworks repositories to build your application ?

Check these videos if not:

https://community.hortonworks.com/articles/146583/how-to-setup-hortonworks-repository-for-spark-on-i...

https://community.hortonworks.com/articles/147787/how-to-setup-hortonworks-repository-for-spark-on-e...

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Highlighted

Re: Spark 2 accessing HDFS via Knox

New Contributor

Hello @Felix Albani, I tried hard here to get access without success. But now I am trying to use a new approach. The services are protected by Kerberus, why not use kinit?. I was able to connect to the metastore but I am getting the following error:

Caused by: java.lang.LinkageError: ClassCastException: attempting to castjar:file:/C:/Users/bquintin070317/.ivy2/cache/javax.ws.rs/javax.ws.rs-api/jars/javax.ws.rs-api-2.0.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class to jar:file:/C:/Users/bquintin070317/.ivy2/cache/javax.ws.rs/javax.ws.rs-api/jars/javax.ws.rs-api-2.0.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class

I saw in this post in the forum where you should put this code and It should work:

  1. hc = new org.apache.spark.sql.hive.HiveContext(sc)
  2. hc.setConf("yarn.timeline-service.enabled","false")
  3. In my case I am using Spark 2 and I didn't find a way to do this, any idea? What I did was:
val spark = SparkSession
  .builder()
  .master("local")
  .appName("firmMappingReader")
  .enableHiveSupport()
  .config("yarn.timeline-service.enabled","false")
  .getOrCreate()
//even forced here:
spark.sqlContext.setConf("yarn.timeline-service.enabled","false")