- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
PySpark in Hue - pass an insecure Py4j gateway to Spark
- Labels:
-
Apache Spark
Created on 07-19-2019 08:51 AM - edited 09-16-2022 07:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have been researching for a few days on why we cannot execute any Python code in the PySpark interface inside Hue.
PySpark command:
from pyspark import SparkContext
Error Message:
stdout: stderr: WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2/). WARNING: Running spark-class from user-defined location. 19/07/19 07:59:45 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead. 19/07/19 07:59:46 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead. 19/07/19 07:59:46 WARN rsc.RSCConf: Your hostname, usbda04.unix.rgbk.com, resolves to a loopback address, but we couldn't find any external IP address! 19/07/19 07:59:46 WARN rsc.RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address. 19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. YARN Diagnostics: sys.exit(main()) File "/tmp/2588781570290623481", line 589, in main sc = SparkContext(jsc=jsc, gateway=gateway, conf=conf) File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 121, in __init__ ValueError: You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk. YARN Diagnostics:
We recently update Spark from 2.3 to 2.4. However, I am not sure if it was working with 2.3. We also recently activated Kerberos.
I am not sure what this message is saying but my guess is a configuration is setup to send requests to a specific server (the gateway) and it's not SSL encrypted on the target server, so there is a rule setup to avoid sending requests to non-SSL services? If this is the case, it's not important that the traffic be encrypted as this is a development server.
Any theories would be most helpful as I can investigate. The problem is, there is no one that has had this problem according to the research I have done.
One other thing to note, which may no bearing on it at all, but I cannot execute pyspark from the command line as I get what appears to be a very old (2016) bug:
$ pyspark Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in <module> import pyspark File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/__init__.py", line 41, in <module> from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in <module> from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 656, in _load_unlocked File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 18, in <module> File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/pydoc.py", line 59, in <module> import inspect File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/inspect.py", line 361, in <module> Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
Also, this article comes the closest to the possible issue we're having:
https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Issue-with-PySpark-and-Hue/m-p/52792#M2162
However, I don't know how to check to see if Kerberos is setup and setup properly for this purpose. Any guidance is appreciated on this as well.
Any ideas/help would be much appreciated!
Thanks!
Created 07-22-2019 09:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As of Spark 2.4, the context.py code has been changed to require an authentication token. However, I'm not sure how to set this token as I have looked in Cloudera Manager, on the web and in the files and cannot find it anywhere.
Will someone from Cloudera please help us setup this requirement as the code clearly requires it?
Thanks!
Created on 11-19-2019 05:31 AM - edited 11-19-2019 05:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello.
It's possible to fix it i think. I did it but need to test it deeply.
You need to find your pyspark installation, for your case:
/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py
And comment the following lines (arround lines 115). Depending the version 2.3.3, 2.4.0, 2.4.x they can look a little bit different as they changed the code several times there, but the idea is to comment the raise ValueError part:
#if gateway is not None and gateway.gateway_parameters.auth_token is None:
# raise ValueError(
# "You are trying to pass an insecure Py4j gateway to Spark. This"
# " is not allowed as it is a security risk.")
In some other versions it may be necessary also to comment something similar in the java_gateway.py file.
Regards,
Albert
Created 02-19-2020 07:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just bumped into the same issue today. Is there no official word from Cloudera on this? They have made the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable completely unsupported? Is there any official workaround?
Created 02-21-2020 07:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I actually did figure out how to solve this, at least in our case. Hopefully it's relevant to yours as well. See here.
