Created on 02-19-2020 07:24 PM - last edited on 02-19-2020 10:26 PM by VidyaSargur
I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange. Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository.
Take one file as an example.
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py
Jumping to line 114 of this file, we see:
self._callsite = first_spark_call() or CallSite(None, None, None)
if gateway is not None and gateway.gateway_parameters.auth_token is None:
raise ValueError(
"You are trying to pass an insecure Py4j gateway to Spark. This"
" is not allowed as it is a security risk.")
Now, this file should come from Spark 2.4.0 (see both /opt/cloudera/parcels/CDH/lib/spark/RELEASE and /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py files, which indicate this). However, if you look at the upstream source, at the same line number, these lines don't exist. In fact, the source looks quite different. It has no such error referring to the insecure Py4j gateway.
Can anyone explain this discrepancy?
Created 02-19-2020 08:53 PM
@JeffEvans You are right. In CDH we cherry pick jiras to be included in our spark. Not all features available in upstream are expected to be present on CDH spark. The line number you quoted was added in this jira https://issues.apache.org/jira/browse/SPARK-1087 and is not back-ported to our spark code base. This is one of the reason we quote the following in our documentation
Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera.
Hope this clarifies.
Created 02-19-2020 08:53 PM
@JeffEvans You are right. In CDH we cherry pick jiras to be included in our spark. Not all features available in upstream are expected to be present on CDH spark. The line number you quoted was added in this jira https://issues.apache.org/jira/browse/SPARK-1087 and is not back-ported to our spark code base. This is one of the reason we quote the following in our documentation
Although this document makes some references to the external Spark site, not all the features, components, recommendations, and so on are applicable to Spark when used on CDH. Always cross-check the Cloudera documentation before building a reliance on some aspect of Spark that might not be supported or recommended by Cloudera.
Hope this clarifies.
Created 02-20-2020 12:28 PM
Thanks. Any chance you can point me to the Cloudera specific documentation on dealing with this error?
You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk.
In the upstream Spark, there is an environment variable that can be set to mitigate this problem. However, it's not clear how to properly fix this in the Cloudera version. Thanks.
Created 02-20-2020 10:07 PM
May I know the exact steps you followed to replicate the issue? Are you noticing this error when running any code snippet. Can we have a shorter version of the script to replicate on my side and evaluate further?
Created 02-21-2020 07:51 AM
First, thanks for following up. I sincerely appreciate the response.
I figured out what's happening. To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway. It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere. The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token. Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication.
There is a very short code example of this on the Py4J website. It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway). Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway. So essentially, what you need to do is:
This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for PYSPARK_ALLOW_INSECURE_GATEWAY).
Created 02-21-2020 09:11 AM
Thanks for the awesome explanation!! This comment from spark explains on the reason for allowing insecure connection https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16719231&page=com.atlassian.jira....