Created on 07-19-2019 08:51 AM - edited 09-16-2022 07:31 AM
Hi,
I have been researching for a few days on why we cannot execute any Python code in the PySpark interface inside Hue.
PySpark command:
from pyspark import SparkContext
Error Message:
stdout: stderr: WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2/). WARNING: Running spark-class from user-defined location. 19/07/19 07:59:45 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead. 19/07/19 07:59:46 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead. 19/07/19 07:59:46 WARN rsc.RSCConf: Your hostname, usbda04.unix.rgbk.com, resolves to a loopback address, but we couldn't find any external IP address! 19/07/19 07:59:46 WARN rsc.RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address. 19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. YARN Diagnostics: sys.exit(main()) File "/tmp/2588781570290623481", line 589, in main sc = SparkContext(jsc=jsc, gateway=gateway, conf=conf) File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 121, in __init__ ValueError: You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk. YARN Diagnostics:
We recently update Spark from 2.3 to 2.4. However, I am not sure if it was working with 2.3. We also recently activated Kerberos.
I am not sure what this message is saying but my guess is a configuration is setup to send requests to a specific server (the gateway) and it's not SSL encrypted on the target server, so there is a rule setup to avoid sending requests to non-SSL services? If this is the case, it's not important that the traffic be encrypted as this is a development server.
Any theories would be most helpful as I can investigate. The problem is, there is no one that has had this problem according to the research I have done.
One other thing to note, which may no bearing on it at all, but I cannot execute pyspark from the command line as I get what appears to be a very old (2016) bug:
$ pyspark Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in <module> import pyspark File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/__init__.py", line 41, in <module> from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in <module> from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 656, in _load_unlocked File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 18, in <module> File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/pydoc.py", line 59, in <module> import inspect File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/inspect.py", line 361, in <module> Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
Also, this article comes the closest to the possible issue we're having:
https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Issue-with-PySpark-and-Hue/m-p/52792#M2162
However, I don't know how to check to see if Kerberos is setup and setup properly for this purpose. Any guidance is appreciated on this as well.
Any ideas/help would be much appreciated!
Thanks!
Created 07-22-2019 09:05 AM
Created on 11-19-2019 05:31 AM - edited 11-19-2019 05:33 AM
Hello.
It's possible to fix it i think. I did it but need to test it deeply.
You need to find your pyspark installation, for your case:
/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py
And comment the following lines (arround lines 115). Depending the version 2.3.3, 2.4.0, 2.4.x they can look a little bit different as they changed the code several times there, but the idea is to comment the raise ValueError part:
#if gateway is not None and gateway.gateway_parameters.auth_token is None:
# raise ValueError(
# "You are trying to pass an insecure Py4j gateway to Spark. This"
# " is not allowed as it is a security risk.")
In some other versions it may be necessary also to comment something similar in the java_gateway.py file.
Regards,
Albert
Created 02-19-2020 07:34 PM
I just bumped into the same issue today. Is there no official word from Cloudera on this? They have made the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable completely unsupported? Is there any official workaround?
Created 02-21-2020 07:52 AM
I actually did figure out how to solve this, at least in our case. Hopefully it's relevant to yours as well. See here.