Reply
Contributor
Posts: 52
Registered: ‎10-19-2016

connect kerberozied cluster from pyspark in windows machine

Hi, 

 

I would like to use pyspark with PyCharm in Windows machine. I act as the following:

 

1. kinit using MIT kerberos for Windows. confirm that it is the proper principal.

2. download hive config files from Cloudera Manger, and export as "HADOOP_CONF_DIR"

3. set %JAVA_HOME%\jre\bin\security\krb5.ini with proper realm and kdc.

4. start PyCharm python console.

5. run the following script:

   

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Pyspark Demo") \
.config("spark.authenticate.enableSaslEncryption", "true") \
.config("spark.sql.warehouse.dir", "hdfs://nameservice1/user/hive/warehouse") \
.getOrCreate()
.config("spark.authenticate", "true") \
print ("Spark version: {}".format(spark.version))
df = spark.sql("SHOW DATABASES")
df.show()

 

Here's my problem. No matter which principle I kinit, df.show() always return the only default database. It seems  pyspark is using a wrong kerberos principle?

 

Thanks.

 

Highlighted
Cloudera Employee
Posts: 97
Registered: ‎05-10-2016

Re: connect kerberozied cluster from pyspark in windows machine

Kerberos principal doesn't affect databases that will be shown.  You are most likely showing the databases available locally with the embeded derby metastore.  To be able to view Hive databases and tables, you will need to configure spark to connect to Hive's metastore.  You can do this by setting up the machine as a gateway within cloudera manager or adding hive-site.xml into sparks configurations.