Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

PySpark from CDSW/Jupyter, unable to get working combination with Python 3.6.9

avatar

Hi All,

 

We're currently running CDH 6.3.2 and trying to get Python 3.6.9 working on Spark working nodes. So far without success, though... As our cluster is built on RHEL7, default Python version is 2.7, which is EOL and doesn't have all the necessary libraries/modules. Original idea was to install Miniconda3 on all the nodes, create the environment py36 and point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the proper python executable. Apparently this doesn't work as expected, any PySpark job breaks with error message:

 

Fatal Python error: Py_Initialize: Unable to get the locale encoding
  ModuleNotFoundError: No module named 'encodings'

Despite the fact that module "encodings" is out there in py36 environment. If we add 

spark.executorEnv.PYTHONPATH=/opt/miniconda3/envs/py36/lib/python3.6:/opt/miniconda3/envs/py36/lib/

to spark-defaults.conf file in the project root, we becoming different error message:

Traceback (most recent call last):
    File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 183, in _run_module_as_main
      mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
    File "/opt/miniconda3/envs/py36/lib/python3.6/runpy.py", line 109, in _get_module_details
      __import__(pkg_name)
  zipimport.ZipImportError: can't decompress data; zlib not available

Which is also not correct, because zlib is installed.

We tried to install and activate official Anaconda parcel for CDH, but it comes with Python 2.7, so the question at the end of the day is the same -- how to tell Spark worker node to use specific python version or virtual environment from Jupyter notebook started on CDSW?

We've already found some guides in the Net explaining how to tell Spark to use conda, using set of spark.pyspark.virtualenv.* variables in spark-defaults.conf, but they don't seem to affect anything.

This particular python version is officially recommended one for CDSW, so there should be the possibility to use it on Spark worker nodes as well...

Thanx in advance,

Kirill

Who agreed with this topic