Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

PySpark source code bundled with CDH 6.3 Spark doesn't match Apache source

avatar
Explorer

I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange.  Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository.

 

Take one file as an example.

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py

 

Jumping to line 114 of this file, we see:

 

 

        self._callsite = first_spark_call() or CallSite(None, None, None)
        if gateway is not None and gateway.gateway_parameters.auth_token is None:
            raise ValueError(
                "You are trying to pass an insecure Py4j gateway to Spark. This"
                " is not allowed as it is a security risk.")

 

 

Now, this file should come from Spark 2.4.0 (see both /opt/cloudera/parcels/CDH/lib/spark/RELEASE and /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py files, which indicate this).  However, if you look at the upstream source, at the same line number, these lines don't exist.  In fact, the source looks quite different.  It has no such error referring to the insecure Py4j gateway.

 

Can anyone explain this discrepancy?

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@JeffEvans You are right. In CDH we cherry pick jiras to be included in our spark. Not all features available in upstream are expected to be present on CDH spark. The line number you quoted was added in this jira https://issues.apache.org/jira/browse/SPARK-1087 and is not back-ported to our spark code base. This is one of the reason we quote the following in our documentation

Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera. 

Hope this clarifies. 

View solution in original post

5 REPLIES 5

avatar
Master Collaborator

@JeffEvans You are right. In CDH we cherry pick jiras to be included in our spark. Not all features available in upstream are expected to be present on CDH spark. The line number you quoted was added in this jira https://issues.apache.org/jira/browse/SPARK-1087 and is not back-ported to our spark code base. This is one of the reason we quote the following in our documentation

Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera. 

Hope this clarifies. 

avatar
Explorer

Thanks.  Any chance you can point me to the Cloudera specific documentation on dealing with this error?

 

You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk.

 

In the upstream Spark, there is an environment variable that can be set to mitigate this problem.  However, it's not clear how to properly fix this in the Cloudera version.  Thanks.

avatar
Master Collaborator

May I know the exact steps you followed to replicate the issue? Are you noticing this error when running any code snippet. Can we have a shorter version of the script to replicate on my side and evaluate further?

avatar
Explorer

First, thanks for following up.  I sincerely appreciate the response.

 

I figured out what's happening.  To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway.  It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere.  The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token.  Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication.

 

There is a very short code example of this on the Py4J website.  It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway).  Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway.  So essentially, what you need to do is:

  • Generate or obtain some type of shared secret string to use as the auth_token
  • In the Scala/Java side, pass this auth_token as a parameter when creating the GatewayServer
  • When invoking the Python code, pass the auth_token to it in some secure manner
  • In the Python code, use the auth_token that was passed as a parameter in the GatewayParameters

This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for PYSPARK_ALLOW_INSECURE_GATEWAY).

avatar
Master Collaborator

Thanks for the awesome explanation!! This comment from spark explains on the reason for allowing insecure connection https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&page=com.atlassian.jira....