Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Conda env && Spark Jobs

Conda env && Spark Jobs

New Contributor
Hi,

As many of you I guess, I mix pyspark jobs and regular pandas dataframes and scikit-learn for data science.

But I'm sharing the platform with many other data scientists and we might end with a libs mess.

I'd love to be able to have conda envs for projects I could activate separately and in parallel for each job context.

Manageable ?

Thanks.

Yann
4 REPLIES 4

Re: Conda env

Master Collaborator
Are you maybe looking for the Anaconda parcel for CDH?
http://know.continuum.io/anaconda-for-cloudera.html

That gives you a fixed distribution of Python that supports Pyspark
and probably your other use cases on the cluster.
This is what virtualenvs can help with in any event.

Re: Conda env

New Contributor
Hi,


Parcels are part of it. But then how do I create and activate virtual envs
for a specifics jobs ?

Re: Conda env

Explorer

if you installed the anaconda parcel for CDH then one possibility to use the new python environment is this command:

PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python spark-submit <script.py>

 

alternatively you can enter into the pyspark shell with this command:
shell command: PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python pyspark

 

check that the parcel installation directory is correct for you (/var/opt/teradata/cloudera/parcels/Anaconda/bin/python)

 

if you want to change to default python configuration for your whole cluster then follow this guide:

http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/

Highlighted

Re: Conda env

Explorer

I am not sure if what you want to achieve is possible yet using different virtual envs on the master and worker nodes. However you could try to create virtual envs on all the nodes at the same location using Ansible or Puppet. Afterwards modify the spark-env.sh, this script is executed on all nodes when a Spark job runs. So activate the desired virtual env in the spark-env.sh and set the environment variable PYSPARK_PYTHON to the location of python in the desired virtual env.

 

Otherwise another alternative could be to use YARN with Docker Containers. However this requires some research to get it working. However the theoretical idea would be to have the Spark Driver and Executors running in Docker containers provided with the desired python libraries.

 

Fingers crossed ;)