Support Questions

pollard · ‎07-19-2019

Hi,

I have been researching for a few days on why we cannot execute any Python code in the PySpark interface inside Hue.

PySpark command:

from pyspark import SparkContext

Error Message:

stdout:
stderr:
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2/).
WARNING: Running spark-class from user-defined location.
19/07/19 07:59:45 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
19/07/19 07:59:46 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
19/07/19 07:59:46 WARN rsc.RSCConf: Your hostname, usbda04.unix.rgbk.com, resolves to a loopback address, but we couldn't find any external IP address!
19/07/19 07:59:46 WARN rsc.RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
YARN Diagnostics:
sys.exit(main())
File "/tmp/2588781570290623481", line 589, in main
sc = SparkContext(jsc=jsc, gateway=gateway, conf=conf)
File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 121, in __init__
ValueError: You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk.
YARN Diagnostics:

We recently update Spark from 2.3 to 2.4. However, I am not sure if it was working with 2.3. We also recently activated Kerberos.

I am not sure what this message is saying but my guess is a configuration is setup to send requests to a specific server (the gateway) and it's not SSL encrypted on the target server, so there is a rule setup to avoid sending requests to non-SSL services? If this is the case, it's not important that the traffic be encrypted as this is a development server.

Any theories would be most helpful as I can investigate. The problem is, there is no one that has had this problem according to the research I have done.

One other thing to note, which may no bearing on it at all, but I cannot execute pyspark from the command line as I get what appears to be a very old (2016) bug:

$ pyspark
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in <module>
    import pyspark
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 656, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 18, in <module>
  File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/pydoc.py", line 59, in <module>
    import inspect
  File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/inspect.py", line 361, in <module>
    Attribute = namedtuple('Attribute', 'name kind defining_class object')
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple
    cls = _old_namedtuple(*args, **kwargs)
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

Also, this article comes the closest to the possible issue we're having:

https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Issue-with-PySpark-and-Hue/m-p/52792#M2162

However, I don't know how to check to see if Kerberos is setup and setup properly for this purpose. Any guidance is appreciated on this as well.

Any ideas/help would be much appreciated!

Thanks!

pollard · ‎07-22-2019

UPDATE:
As of Spark 2.4, the context.py code has been changed to require an authentication token. However, I'm not sure how to set this token as I have looked in Cloudera Manager, on the web and in the files and cannot find it anywhere.
Will someone from Cloudera please help us setup this requirement as the code clearly requires it?
Thanks!

anogues1985 · ‎11-19-2019

Hello.

It's possible to fix it i think. I did it but need to test it deeply.

You need to find your pyspark installation, for your case:

/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py

And comment the following lines (arround lines 115). Depending the version 2.3.3, 2.4.0, 2.4.x they can look a little bit different as they changed the code several times there, but the idea is to comment the raise ValueError part:

#if gateway is not None and gateway.gateway_parameters.auth_token is None:
# raise ValueError(
# "You are trying to pass an insecure Py4j gateway to Spark. This"
# " is not allowed as it is a security risk.")

In some other versions it may be necessary also to comment something similar in the java_gateway.py file.

Regards,
Albert

JeffEvans · ‎02-19-2020

I just bumped into the same issue today. Is there no official word from Cloudera on this? They have made the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable completely unsupported? Is there any official workaround?

JeffEvans · ‎02-21-2020

I actually did figure out how to solve this, at least in our case. Hopefully it's relevant to yours as well. See here.

Cloudera Community

Support Questions

PySpark in Hue - pass an insecure Py4j gateway to Spark