Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

PySpark in Hue - pass an insecure Py4j gateway to Spark

Highlighted

PySpark in Hue - pass an insecure Py4j gateway to Spark

Contributor

Hi,

 

I have been researching for a few days on why we cannot execute any Python code in the PySpark interface inside Hue.

 

PySpark command:

 

from pyspark import SparkContext

 

Error Message:

 

stdout:
stderr:
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2/).
WARNING: Running spark-class from user-defined location.
19/07/19 07:59:45 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
19/07/19 07:59:46 WARN spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
19/07/19 07:59:46 WARN rsc.RSCConf: Your hostname, usbda04.unix.rgbk.com, resolves to a loopback address, but we couldn't find any external IP address!
19/07/19 07:59:46 WARN rsc.RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/07/19 07:59:49 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
YARN Diagnostics:
sys.exit(main())
File "/tmp/2588781570290623481", line 589, in main
sc = SparkContext(jsc=jsc, gateway=gateway, conf=conf)
File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 121, in __init__
ValueError: You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk.
YARN Diagnostics:

We recently update Spark from 2.3 to 2.4.  However, I am not sure if it was working with 2.3.  We also recently activated Kerberos.  

I am not sure what this message is saying but my guess is a configuration is setup to send requests to a specific server (the gateway) and it's not SSL encrypted on the target server, so there is a rule setup to avoid sending requests to non-SSL services?  If this is the case, it's not important that the traffic be encrypted as this is a development server.

Any theories would be most helpful as I can investigate.  The problem is, there is no one that has had this problem according to the research I have done.

 

One other thing to note, which may no bearing on it at all, but I cannot execute pyspark from the command line as I get what appears to be a very old (2016) bug:

$ pyspark
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in <module>
    import pyspark
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 656, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 18, in <module>
  File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/pydoc.py", line 59, in <module>
    import inspect
  File "/opt/cloudera/parcels/Anaconda-5.1.0.1/lib/python3.6/inspect.py", line 361, in <module>
    Attribute = namedtuple('Attribute', 'name kind defining_class object')
  File "/opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple
    cls = _old_namedtuple(*args, **kwargs)
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

 

Also, this article comes the closest to the possible issue we're having:

https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Issue-with-PySpark-and-Hue/m-p/52792#M2162

However, I don't know how to check to see if Kerberos is setup and setup properly for this purpose.  Any guidance is appreciated on this as well.

 

Any ideas/help would be much appreciated!

 

Thanks!

 

1 REPLY 1

Re: PySpark in Hue - pass an insecure Py4j gateway to Spark

Contributor
UPDATE:
As of Spark 2.4, the context.py code has been changed to require an authentication token. However, I'm not sure how to set this token as I have looked in Cloudera Manager, on the web and in the files and cannot find it anywhere.
Will someone from Cloudera please help us setup this requirement as the code clearly requires it?
Thanks!