Created 04-21-2020 03:18 AM
Hi All,
We're currently running CDH 6.3.2 and trying to get Python 3.6.9 working on Spark working nodes. So far without success, though... As our cluster is built on RHEL7, default Python version is 2.7, which is EOL and doesn't have all the necessary libraries/modules. Original idea was to install Miniconda3 on all the nodes, create the environment py36 and point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the proper python executable. Apparently this doesn't work as expected, any PySpark job breaks with error message:
Fatal Python error: Py_Initialize: Unable to get the locale encoding ModuleNotFoundError: No module named 'encodings'
Despite the fact that module "encodings" is out there in py36 environment. If we add
spark.executorEnv.PYTHONPATH=/opt/miniconda3/envs/py36/lib/python3.6:/opt/miniconda3/envs/py36/lib/
to spark-defaults.conf file in the project root, we becoming different error message:
Traceback (most recent call last): File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) zipimport.ZipImportError: can't decompress data; zlib not available
Which is also not correct, because zlib is installed.
We tried to install and activate official Anaconda parcel for CDH, but it comes with Python 2.7, so the question at the end of the day is the same -- how to tell Spark worker node to use specific python version or virtual environment from Jupyter notebook started on CDSW?
We've already found some guides in the Net explaining how to tell Spark to use conda, using set of spark.pyspark.virtualenv.* variables in spark-defaults.conf, but they don't seem to affect anything.
This particular python version is officially recommended one for CDSW, so there should be the possibility to use it on Spark worker nodes as well...
Thanx in advance,
Kirill
Created 04-23-2020 06:03 AM
Good news!
Long story short -- the problem solved and it was not the CDSW/Jupyter specific one 😉
A bit more verbose explanation. For me it looks clearly like Cloudera Manager bug, which should be addressed accordingly.
In the process of investigation, one of my colleagues suggested to check if command-line pyspark shell is working correctly and apparently it wasn't. Checked from edge node, pyspark was able to start, but threw an error message:
[user.name@hostname.domain ~]$ pyspark
File "/opt/miniconda3/envs/py36/lib/python3.6/site.py", line 177
file=sys.stderr)
^
SyntaxError: invalid syntax
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
Which told me, that Spark took the proper Python 3.6.9 executable from Miniconda installation as expected, but is unable to process its own libraries/modules and that was quite weird...
Googling for the error gave me the advice to check if Pathon is not mixing 2.7 and 3.6 environments, in other words, despite reporting that PySpark is started in Python 3.6.9, the error message came from Python 2.7.5 (OS "native" one), so I started to look for spark-conf/spark-env.sh files and found that bottom part of these files is corrupt and looks like:
export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3
Yes, exactly like this, 3 lines instead of 4 and one line concatenated from 2 equal blocks, but without extra new line between them.
So I went to Cloudera Manager -> Spark -> Configuration and found, that both "extra fields", Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh were containing two lines each:
export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3
On the worker nodes, /bin/python3 points to the proper Python 3.6.9 executable in Miniconda3 installation, so it would work fine if Cloudera Manager woud be able to glue both those sections together with just one newline inbetween. But it looks like those both "freetext" fields are simply used "as is", sticked together and glued at the bottom of the spark-env.sh file template. In our particular case it gave pyspark the strange mixture of system's default python 2.7 and conda's 3.6.9.
So, at the end of the day we wiped out those "extras" in Cloudera Manager controlled Spark configuration and after restart everything ran smooth as expected.
Created 04-23-2020 06:03 AM
Good news!
Long story short -- the problem solved and it was not the CDSW/Jupyter specific one 😉
A bit more verbose explanation. For me it looks clearly like Cloudera Manager bug, which should be addressed accordingly.
In the process of investigation, one of my colleagues suggested to check if command-line pyspark shell is working correctly and apparently it wasn't. Checked from edge node, pyspark was able to start, but threw an error message:
[user.name@hostname.domain ~]$ pyspark
File "/opt/miniconda3/envs/py36/lib/python3.6/site.py", line 177
file=sys.stderr)
^
SyntaxError: invalid syntax
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
Which told me, that Spark took the proper Python 3.6.9 executable from Miniconda installation as expected, but is unable to process its own libraries/modules and that was quite weird...
Googling for the error gave me the advice to check if Pathon is not mixing 2.7 and 3.6 environments, in other words, despite reporting that PySpark is started in Python 3.6.9, the error message came from Python 2.7.5 (OS "native" one), so I started to look for spark-conf/spark-env.sh files and found that bottom part of these files is corrupt and looks like:
export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3
Yes, exactly like this, 3 lines instead of 4 and one line concatenated from 2 equal blocks, but without extra new line between them.
So I went to Cloudera Manager -> Spark -> Configuration and found, that both "extra fields", Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh were containing two lines each:
export PYSPARK_DRIVER_PYTHON=/bin/python3
export PYSPARK_PYTHON=/bin/python3
On the worker nodes, /bin/python3 points to the proper Python 3.6.9 executable in Miniconda3 installation, so it would work fine if Cloudera Manager woud be able to glue both those sections together with just one newline inbetween. But it looks like those both "freetext" fields are simply used "as is", sticked together and glued at the bottom of the spark-env.sh file template. In our particular case it gave pyspark the strange mixture of system's default python 2.7 and conda's 3.6.9.
So, at the end of the day we wiped out those "extras" in Cloudera Manager controlled Spark configuration and after restart everything ran smooth as expected.
Created 04-23-2020 10:33 AM
Hi @kirill_peskov ,
Thanks for reaching out to Cloudera community and share your solution.
I was able to reproduce the issue in house and I will follow up with internal CM engineering.
Thanks!
Li
Li Wang, Technical Solution Manager