I managed to set up and configure a spark 2.1 with CDH 5.9 (pointing to the hadoop configuration directories) but I cant find which specific settings should I change in spark-env.sh to be able to access a Kerberized HDFS.
I tried to launch a shell with
spark-shell --master=spark://IP:PORT --keytab <path_to_keytab> --principal <principal@REALM>
But trying to read the files fails because spark cannot connect the NameNode, obviously the Namenode require a token...
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "cl1deb03dn/10.0.0.6"; destination host is: "cl1deb01nn.lab.hadoop.cloudapp.net":8020;
Should I create for the master and slave specific principals in KDC. If yes, where should I place the keytabs and where to put the configuration - which principal to use and where is the file?
Instead of setting properties within spark_env.sh, you may want to look at setting HADOOP_CONF_DIR environment variable to point to configuration files for Namenode and YARN. Cloudera Manager can help manage these configuration files and even distribute those to servers configured as gateway nodes.