Support Questions

kirill_peskov · ‎04-21-2020

Hi All,

We're currently running CDH 6.3.2 and trying to get Python 3.6.9 working on Spark working nodes. So far without success, though... As our cluster is built on RHEL7, default Python version is 2.7, which is EOL and doesn't have all the necessary libraries/modules. Original idea was to install Miniconda3 on all the nodes, create the environment py36 and point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the proper python executable. Apparently this doesn't work as expected, any PySpark job breaks with error message:

Fatal Python error: Py_Initialize: Unable to get the locale encoding
  ModuleNotFoundError: No module named 'encodings'

Despite the fact that module "encodings" is out there in py36 environment. If we add

spark.executorEnv.PYTHONPATH=/opt/miniconda3/envs/py36/lib/python3.6:/opt/miniconda3/envs/py36/lib/

to spark-defaults.conf file in the project root, we becoming different error message:

Traceback (most recent call last):
    File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 183, in _run_module_as_main
      mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
    File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 109, in _get_module_details
      __import__(pkg_name)
  zipimport.ZipImportError: can't decompress data; zlib not available

Which is also not correct, because zlib is installed.

We tried to install and activate official Anaconda parcel for CDH, but it comes with Python 2.7, so the question at the end of the day is the same -- how to tell Spark worker node to use specific python version or virtual environment from Jupyter notebook started on CDSW?

We've already found some guides in the Net explaining how to tell Spark to use conda, using set of spark.pyspark.virtualenv.* variables in spark-defaults.conf, but they don't seem to affect anything.

This particular python version is officially recommended one for CDSW, so there should be the possibility to use it on Spark worker nodes as well...

Thanx in advance,

Kirill

kirill_peskov · ‎04-23-2020

Good news!

Long story short -- the problem solved and it was not the CDSW/Jupyter specific one 😉

A bit more verbose explanation. For me it looks clearly like Cloudera Manager bug, which should be addressed accordingly.

In the process of investigation, one of my colleagues suggested to check if command-line pyspark shell is working correctly and apparently it wasn't. Checked from edge node, pyspark was able to start, but threw an error message:

[user.name@hostname.domain ~]$ pyspark
File "/opt/miniconda3/envs/py36/lib/python3.6/site.py", line 177
file=sys.stderr)
^
SyntaxError: invalid syntax
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)

Which told me, that Spark took the proper Python 3.6.9 executable from Miniconda installation as expected, but is unable to process its own libraries/modules and that was quite weird...

Googling for the error gave me the advice to check if Pathon is not mixing 2.7 and 3.6 environments, in other words, despite reporting that PySpark is started in Python 3.6.9, the error message came from Python 2.7.5 (OS "native" one), so I started to look for spark-conf/spark-env.sh files and found that bottom part of these files is corrupt and looks like:

export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3

Yes, exactly like this, 3 lines instead of 4 and one line concatenated from 2 equal blocks, but without extra new line between them.

So I went to Cloudera Manager -> Spark -> Configuration and found, that both "extra fields", Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh were containing two lines each:

export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3

On the worker nodes, /bin/python3 points to the proper Python 3.6.9 executable in Miniconda3 installation, so it would work fine if Cloudera Manager woud be able to glue both those sections together with just one newline inbetween. But it looks like those both "freetext" fields are simply used "as is", sticked together and glued at the bottom of the spark-env.sh file template. In our particular case it gave pyspark the strange mixture of system's default python 2.7 and conda's 3.6.9.

So, at the end of the day we wiped out those "extras" in Cloudera Manager controlled Spark configuration and after restart everything ran smooth as expected.

View solution in original post

kirill_peskov · ‎04-23-2020

Good news!

Long story short -- the problem solved and it was not the CDSW/Jupyter specific one 😉

A bit more verbose explanation. For me it looks clearly like Cloudera Manager bug, which should be addressed accordingly.

In the process of investigation, one of my colleagues suggested to check if command-line pyspark shell is working correctly and apparently it wasn't. Checked from edge node, pyspark was able to start, but threw an error message:

[user.name@hostname.domain ~]$ pyspark
File "/opt/miniconda3/envs/py36/lib/python3.6/site.py", line 177
file=sys.stderr)
^
SyntaxError: invalid syntax
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)

Which told me, that Spark took the proper Python 3.6.9 executable from Miniconda installation as expected, but is unable to process its own libraries/modules and that was quite weird...

Googling for the error gave me the advice to check if Pathon is not mixing 2.7 and 3.6 environments, in other words, despite reporting that PySpark is started in Python 3.6.9, the error message came from Python 2.7.5 (OS "native" one), so I started to look for spark-conf/spark-env.sh files and found that bottom part of these files is corrupt and looks like:

export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3

Yes, exactly like this, 3 lines instead of 4 and one line concatenated from 2 equal blocks, but without extra new line between them.

So I went to Cloudera Manager -> Spark -> Configuration and found, that both "extra fields", Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh were containing two lines each:

export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3

On the worker nodes, /bin/python3 points to the proper Python 3.6.9 executable in Miniconda3 installation, so it would work fine if Cloudera Manager woud be able to glue both those sections together with just one newline inbetween. But it looks like those both "freetext" fields are simply used "as is", sticked together and glued at the bottom of the spark-env.sh file template. In our particular case it gave pyspark the strange mixture of system's default python 2.7 and conda's 3.6.9.

So, at the end of the day we wiped out those "extras" in Cloudera Manager controlled Spark configuration and after restart everything ran smooth as expected.

lwang · ‎04-23-2020

Hi @kirill_peskov ,

Thanks for reaching out to Cloudera community and share your solution.

I was able to reproduce the issue in house and I will follow up with internal CM engineering.

Thanks!

Li

Li Wang, Technical Solution Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

Cloudera Community

Support Questions

PySpark from CDSW/Jupyter, unable to get working combination with Python 3.6.9

Spark Python Supportability Matrix

Spark Python Integration Test Result Exceptions

Jupyter R Sessions within CDSW

How to configure Zeppelin Pyspark Interpreter to u...

How to Create an Iceberg Table with PySpark in Clo...

CDEPY: a Python Package to work with Cloudera Data...

Using VirtualEnv with PySpark

pyspark toPandas() works locally but fails in clus...

Using VirtualEnv with PySpark

trying to get the most basic python UDFs working