<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290251#M214770</link>
    <description>&lt;P&gt;Thanks for the awesome explanation!! This comment from spark explains on the reason for allowing insecure connection&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719231" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719231&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 21 Feb 2020 17:11:10 GMT</pubDate>
    <dc:creator>venkatsambath</dc:creator>
    <dc:date>2020-02-21T17:11:10Z</dc:date>
    <item>
      <title>PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290127#M214691</link>
      <description>&lt;P&gt;I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange.&amp;nbsp; Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Take one file as an example.&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier,monospace"&gt;/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Jumping to line 114 of this file, we see:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;        self._callsite = first_spark_call() or CallSite(None, None, None)
        if gateway is not None and gateway.gateway_parameters.auth_token is None:
            raise ValueError(
                "You are trying to pass an insecure Py4j gateway to Spark. This"
                " is not allowed as it is a security risk.")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now, this file should come from Spark 2.4.0 (see both&amp;nbsp;&lt;FONT face="courier new,courier,monospace"&gt;/opt/cloudera/parcels/CDH/lib/spark/RELEASE&lt;/FONT&gt; and&amp;nbsp;&lt;FONT face="courier new,courier,monospace"&gt;/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py&lt;/FONT&gt; files, which indicate this).&amp;nbsp; However, if you look at the upstream source, &lt;A href="https://github.com/apache/spark/blob/v2.4.0/python/pyspark/context.py#L114" target="_self"&gt;at the same line number&lt;/A&gt;, these lines don't exist.&amp;nbsp; In fact, the source looks quite different.&amp;nbsp; It has no such error referring to the insecure Py4j gateway.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Can anyone explain this discrepancy?&lt;/P&gt;</description>
      <pubDate>Thu, 20 Feb 2020 06:26:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290127#M214691</guid>
      <dc:creator>JeffEvans</dc:creator>
      <dc:date>2020-02-20T06:26:35Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290130#M214694</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/33683"&gt;@JeffEvans&lt;/a&gt;&amp;nbsp;You are right. In CDH we cherry pick jiras to be included in our spark. Not all features available in upstream are expected to be present on CDH spark. The &lt;A href="https://github.com/apache/spark/blame/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d/python/pyspark/context.py#L114" target="_self"&gt;line number&lt;/A&gt; you quoted was added in this jira&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/SPARK-1087" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-1087&lt;/A&gt;&amp;nbsp;and is not back-ported to our spark code base. This is one of the reason we quote the following in our &lt;A href="https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/spark.html" target="_self"&gt;documentation&lt;/A&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera. &lt;/LI-CODE&gt;&lt;P&gt;Hope this clarifies.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Feb 2020 04:53:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290130#M214694</guid>
      <dc:creator>venkatsambath</dc:creator>
      <dc:date>2020-02-20T04:53:03Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290179#M214729</link>
      <description>&lt;P&gt;Thanks.&amp;nbsp; Any chance you can point me to the Cloudera specific documentation on dealing with this error?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In the upstream Spark, there is an environment variable that can be set to mitigate this problem.&amp;nbsp; However, it's not clear how to properly fix this in the Cloudera version.&amp;nbsp; Thanks.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Feb 2020 20:28:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290179#M214729</guid>
      <dc:creator>JeffEvans</dc:creator>
      <dc:date>2020-02-20T20:28:09Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290219#M214742</link>
      <description>&lt;P&gt;May I know the exact steps you followed to replicate the issue? Are you noticing this error when running any code snippet. Can we have a shorter version of the script to replicate on my side and evaluate further?&lt;/P&gt;</description>
      <pubDate>Fri, 21 Feb 2020 06:07:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290219#M214742</guid>
      <dc:creator>venkatsambath</dc:creator>
      <dc:date>2020-02-21T06:07:16Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290246#M214765</link>
      <description>&lt;P&gt;First, thanks for following up.&amp;nbsp; I sincerely appreciate the response.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I figured out what's happening.&amp;nbsp; To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway.&amp;nbsp; It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere.&amp;nbsp; The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token.&amp;nbsp; Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is a very short code example of this on the &lt;A href="https://www.py4j.org/advanced_topics.html#authentication" target="_self"&gt;Py4J website&lt;/A&gt;.&amp;nbsp; It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway).&amp;nbsp; Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway.&amp;nbsp; So essentially, what you need to do is:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Generate or obtain some type of shared secret string to use as the &lt;FONT face="courier new,courier,monospace"&gt;auth_token&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;In the Scala/Java side, pass this &lt;FONT face="courier new,courier,monospace"&gt;auth_token&lt;/FONT&gt; as a parameter when creating the &lt;FONT face="courier new,courier,monospace"&gt;GatewayServer&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;When invoking the Python code, pass the &lt;FONT face="courier new,courier,monospace"&gt;auth_token&lt;/FONT&gt; to it in some secure manner&lt;/LI&gt;&lt;LI&gt;In the Python code, use the &lt;FONT face="courier new,courier,monospace"&gt;auth_token&lt;/FONT&gt; that was passed as a parameter in the &lt;FONT face="courier new,courier,monospace"&gt;GatewayParameters&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for&amp;nbsp;&lt;SPAN&gt;&lt;FONT face="courier new,courier,monospace"&gt;PYSPARK_ALLOW_INSECURE_GATEWAY&lt;/FONT&gt;).&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Feb 2020 15:51:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290246#M214765</guid>
      <dc:creator>JeffEvans</dc:creator>
      <dc:date>2020-02-21T15:51:49Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290251#M214770</link>
      <description>&lt;P&gt;Thanks for the awesome explanation!! This comment from spark explains on the reason for allowing insecure connection&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719231" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16719231&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Feb 2020 17:11:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/PySpark-source-code-bundled-with-CDH-6-3-Spark-doesn-t-match/m-p/290251#M214770</guid>
      <dc:creator>venkatsambath</dc:creator>
      <dc:date>2020-02-21T17:11:10Z</dc:date>
    </item>
  </channel>
</rss>

