<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question PySpark + YARN + Kerberos = Chaos? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/46450#M43874</link>
    <description>&lt;P&gt;Hi folks,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have Cloudera Enterprise edition configured on our servers (YARN, Spark History server and the usual suspects). I'm able to run Spark jobs and connect to Hive using the Kerberos credentials on the edge node by simply typing `pyspark`.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now here is the catch: there seems to be no tutorial/code snippet out there which shows how to run a standalone Python script on a client&amp;nbsp;windows box, esp when we throw Kerberos and YARN in the mix. Pretty much all code snippets show:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;from pyspark import SparkConf, SparkContext, HiveContext
conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
# Do stuff&lt;/PRE&gt;&lt;P&gt;It's worth noting there is no snippet out there specifying Kerberos authentication code + showing how Hive parameters are configured. Could someone please provide a snippet which allows me to submit Hive queries to Spark cluster using YARN with Kerberos authentication enabled?&lt;/P&gt;</description>
    <pubDate>Wed, 19 Oct 2016 09:27:37 GMT</pubDate>
    <dc:creator>wahwah</dc:creator>
    <dc:date>2016-10-19T09:27:37Z</dc:date>
    <item>
      <title>PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/46450#M43874</link>
      <description>&lt;P&gt;Hi folks,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have Cloudera Enterprise edition configured on our servers (YARN, Spark History server and the usual suspects). I'm able to run Spark jobs and connect to Hive using the Kerberos credentials on the edge node by simply typing `pyspark`.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now here is the catch: there seems to be no tutorial/code snippet out there which shows how to run a standalone Python script on a client&amp;nbsp;windows box, esp when we throw Kerberos and YARN in the mix. Pretty much all code snippets show:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;from pyspark import SparkConf, SparkContext, HiveContext
conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
# Do stuff&lt;/PRE&gt;&lt;P&gt;It's worth noting there is no snippet out there specifying Kerberos authentication code + showing how Hive parameters are configured. Could someone please provide a snippet which allows me to submit Hive queries to Spark cluster using YARN with Kerberos authentication enabled?&lt;/P&gt;</description>
      <pubDate>Wed, 19 Oct 2016 09:27:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/46450#M43874</guid>
      <dc:creator>wahwah</dc:creator>
      <dc:date>2016-10-19T09:27:37Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/46897#M43875</link>
      <description>&lt;P&gt;You will need to have Spark authenticate via Kerberos. &amp;nbsp;This can be done by specifying correct properties on command line:&amp;nbsp;&lt;A href="https://www.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html" target="_blank"&gt;https://www.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Nov 2016 14:45:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/46897#M43875</guid>
      <dc:creator>hubbarja</dc:creator>
      <dc:date>2016-11-02T14:45:09Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/56452#M43876</link>
      <description>&lt;P&gt;Thanks for the reply; your solution works too.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In my case, it was simply solved by having an active kerberos&amp;nbsp;session and running the spark job using spark-submit; no additional properties required.&lt;/P&gt;</description>
      <pubDate>Sun, 25 Jun 2017 18:48:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/56452#M43876</guid>
      <dc:creator>wahwah</dc:creator>
      <dc:date>2017-06-25T18:48:45Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/77622#M43877</link>
      <description>&lt;P&gt;Hello Experts,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am looking for sample Python code which can initiate a kerberos ticket and impersonate a user within the code to access webhdfs or webhcat. I found some java examples like&amp;nbsp;&lt;A href="http://dewoods.com/blog/hadoop-kerberos-guide" target="_blank"&gt;http://dewoods.com/blog/hadoop-kerberos-guide&lt;/A&gt; but looking for similar python code.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The below python code handles kerberos but doesnt do impersonation:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;import httplib&lt;BR /&gt;import requests&lt;BR /&gt;import json&lt;BR /&gt;from requests_kerberos import HTTPKerberosAuth, REQUIRED&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;kerberos_auth = HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;webhdfs_url = “&lt;A href="http://namenode:50070/webhdfs/v1/tmp?op=LISTSTATUS" rel="nofollow" target="_blank"&gt;http://namenode:50070/webhdfs/v1/tmp?op=LISTSTATUS&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;”&lt;BR /&gt;headers = { ‘X-Requested-By’: ‘someuser’}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;response = requests.get(webhdfs_url, headers=headers, auth=kerberos_auth, verify=False)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;print “webhdfs response statuscode=”, response.status_code&lt;BR /&gt;print “webhdfs response responsetext=”, response.text&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jul 2018 21:57:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/77622#M43877</guid>
      <dc:creator>ebeb</dc:creator>
      <dc:date>2018-07-26T21:57:41Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/77653#M43878</link>
      <description>&lt;P&gt;As this question has already been marked resolved and you are looking for python examples instead of pyspark, you may want to ask in a new question.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But, you may also want to look at the various python libraries that already implement functionality to access HDFS data.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Jul 2018 15:24:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/77653#M43878</guid>
      <dc:creator>hubbarja</dc:creator>
      <dc:date>2018-07-27T15:24:41Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/78114#M43879</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/16433"&gt;@hubbarja&lt;/a&gt;&lt;/P&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I decided not to open a new topic, but I'm currently facing issues when trying to connect pyspark with a HBase with Kerberos.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The following code works if I shutdown Kerberos in HBase:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;%pyspark

host = 'hostname'
tablename = 'Test:Test2'

conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": tablename}

keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)

hbase_rdd.collect()&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The following error is thrown with Kerberos on&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=32, exceptions:
Mon Aug 06 11:36:55 UTC 2018, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68623: row 'Test:Test2,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hostname,60020,1533550276857, seqNum=0&lt;/PRE&gt;&lt;P&gt;Best regards,&lt;/P&gt;&lt;P&gt;Gil Pinheiro&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Aug 2018 11:47:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/78114#M43879</guid>
      <dc:creator>gmpinheiro</dc:creator>
      <dc:date>2018-08-06T11:47:02Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark + YARN + Kerberos = Chaos?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/285463#M43880</link>
      <description>&lt;P&gt;Any suggestion to above request?&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2019 14:16:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-YARN-Kerberos-Chaos/m-p/285463#M43880</guid>
      <dc:creator>P_</dc:creator>
      <dc:date>2019-12-12T14:16:16Z</dc:date>
    </item>
  </channel>
</rss>

