Member since
04-26-2019
12
Posts
0
Kudos Received
0
Solutions
02-21-2020
07:51 AM
First, thanks for following up. I sincerely appreciate the response. I figured out what's happening. To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway. It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere. The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token. Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication. There is a very short code example of this on the Py4J website. It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway). Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway. So essentially, what you need to do is: Generate or obtain some type of shared secret string to use as the auth_token In the Scala/Java side, pass this auth_token as a parameter when creating the GatewayServer When invoking the Python code, pass the auth_token to it in some secure manner In the Python code, use the auth_token that was passed as a parameter in the GatewayParameters This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for PYSPARK_ALLOW_INSECURE_GATEWAY).
... View more
02-20-2020
12:28 PM
Thanks. Any chance you can point me to the Cloudera specific documentation on dealing with this error? You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk. In the upstream Spark, there is an environment variable that can be set to mitigate this problem. However, it's not clear how to properly fix this in the Cloudera version. Thanks.
... View more
02-19-2020
07:24 PM
I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange. Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository.
Take one file as an example.
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py
Jumping to line 114 of this file, we see:
self._callsite = first_spark_call() or CallSite(None, None, None)
if gateway is not None and gateway.gateway_parameters.auth_token is None:
raise ValueError(
"You are trying to pass an insecure Py4j gateway to Spark. This"
" is not allowed as it is a security risk.")
Now, this file should come from Spark 2.4.0 (see both /opt/cloudera/parcels/CDH/lib/spark/RELEASE and /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py files, which indicate this). However, if you look at the upstream source, at the same line number, these lines don't exist. In fact, the source looks quite different. It has no such error referring to the insecure Py4j gateway.
Can anyone explain this discrepancy?
... View more
Labels:
- Labels:
-
Apache Spark
11-01-2019
11:49 AM
In working with a particular HDP 3.1 cluster, with Spark 2.3 installed, I am finding that the Spark client libraries (ex: spark-cli command, as well as libraries under jars) are not available on every node. They are only installed on the nodes the customer refers to as "client nodes" (I believe this is analogous to "edge nodes"). They also have data nodes in the cluster, which are able to run Spark executors (and, in fact, YARN does distribute tasks to executors on them), but those nodes do not have Spark client libraries installed.
Is this a normal setup? Can I not assume that the Spark client is installed on every node, even if it is generally available on the cluster? Thanks for any insight.
... View more
Labels: