Reply
New Contributor
Posts: 4
Registered: ‎10-19-2016
Accepted Solution

PySpark + YARN + Kerberos = Chaos?

Hi folks,

 

We have Cloudera Enterprise edition configured on our servers (YARN, Spark History server and the usual suspects). I'm able to run Spark jobs and connect to Hive using the Kerberos credentials on the edge node by simply typing `pyspark`.

 

Now here is the catch: there seems to be no tutorial/code snippet out there which shows how to run a standalone Python script on a client windows box, esp when we throw Kerberos and YARN in the mix. Pretty much all code snippets show:

 

from pyspark import SparkConf, SparkContext, HiveContext
conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
# Do stuff

It's worth noting there is no snippet out there specifying Kerberos authentication code + showing how Hive parameters are configured. Could someone please provide a snippet which allows me to submit Hive queries to Spark cluster using YARN with Kerberos authentication enabled?

Cloudera Employee
Posts: 94
Registered: ‎05-10-2016

Re: PySpark + YARN + Kerberos = Chaos?

You will need to have Spark authenticate via Kerberos.  This can be done by specifying correct properties on command line: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html

Highlighted
New Contributor
Posts: 4
Registered: ‎10-19-2016

Re: PySpark + YARN + Kerberos = Chaos?

Thanks for the reply; your solution works too.

 

In my case, it was simply solved by having an active kerberos session and running the spark job using spark-submit; no additional properties required.

Announcements