Created 04-21-2020 03:18 AM
Hi All,
We're currently running CDH 6.3.2 and trying to get Python 3.6.9 working on Spark working nodes. So far without success, though... As our cluster is built on RHEL7, default Python version is 2.7, which is EOL and doesn't have all the necessary libraries/modules. Original idea was to install Miniconda3 on all the nodes, create the environment py36 and point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the proper python executable. Apparently this doesn't work as expected, any PySpark job breaks with error message:
Fatal Python error: Py_Initialize: Unable to get the locale encoding ModuleNotFoundError: No module named 'encodings'
Despite the fact that module "encodings" is out there in py36 environment. If we add
spark.executorEnv.PYTHONPATH=/opt/miniconda3/envs/py36/lib/python3.6:/opt/miniconda3/envs/py36/lib/
to spark-defaults.conf file in the project root, we becoming different error message:
Traceback (most recent call last): File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) zipimport.ZipImportError: can't decompress data; zlib not available
Which is also not correct, because zlib is installed.
We tried to install and activate official Anaconda parcel for CDH, but it comes with Python 2.7, so the question at the end of the day is the same -- how to tell Spark worker node to use specific python version or virtual environment from Jupyter notebook started on CDSW?
We've already found some guides in the Net explaining how to tell Spark to use conda, using set of spark.pyspark.virtualenv.* variables in spark-defaults.conf, but they don't seem to affect anything.
This particular python version is officially recommended one for CDSW, so there should be the possibility to use it on Spark worker nodes as well...
Thanx in advance,
Kirill