As anaconda provides parcels for CDH, I want to know should I install the parcel when using CDSW and whats the Pros and Cons using Anaconda parcels.
I found that when using the latest Anaconda parcel from repo https://repo.continuum.io/pkgs/misc/parcels/, python2 is already not supported and when I try to run a python2 docker, it complains that python2.7 cannot be found.
Thanks a lot!
No, CDSW does not need Anaconda Parcels to run. However, having Anaconda Parcels deployed on CDH nodes makes it convenient to manage python2.x environments cluster-wide as opposed to having to manually manage them.
Do note that the Anaconda Parcel is a CDH-compatible, relocatable version of the open source Anaconda platform that allows you to get started with easy installation of the Anaconda distribution on your CDH cluster. However, this is different from the commercial Anaconda subscriptions. For eg the Anaconda Parcels for CDH are good if you rely on Python2, as there is no publicly available Anaconda CDH parcel for Python 3.6.
If you don't want to use Anaconda Parcel, you can manually install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1 . For Python 3 sessions you can call PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.