Reply
New Contributor
Posts: 4
Registered: ‎02-19-2016

Conda env && Spark Jobs

Hi,

As many of you I guess, I mix pyspark jobs and regular pandas dataframes and scikit-learn for data science.

But I'm sharing the platform with many other data scientists and we might end with a libs mess.

I'd love to be able to have conda envs for projects I could activate separately and in parallel for each job context.

Manageable ?

Thanks.

Yann
Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Conda env

Are you maybe looking for the Anaconda parcel for CDH?
http://know.continuum.io/anaconda-for-cloudera.html

That gives you a fixed distribution of Python that supports Pyspark
and probably your other use cases on the cluster.
This is what virtualenvs can help with in any event.
New Contributor
Posts: 4
Registered: ‎02-19-2016

Re: Conda env

Hi,


Parcels are part of it. But then how do I create and activate virtual envs
for a specifics jobs ?
Explorer
Posts: 13
Registered: ‎04-04-2016

Re: Conda env

if you installed the anaconda parcel for CDH then one possibility to use the new python environment is this command:

PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python spark-submit <script.py>

 

alternatively you can enter into the pyspark shell with this command:
shell command: PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python pyspark

 

check that the parcel installation directory is correct for you (/var/opt/teradata/cloudera/parcels/Anaconda/bin/python)

 

if you want to change to default python configuration for your whole cluster then follow this guide:

http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/

Highlighted
Explorer
Posts: 14
Registered: ‎05-06-2014

Re: Conda env

I am not sure if what you want to achieve is possible yet using different virtual envs on the master and worker nodes. However you could try to create virtual envs on all the nodes at the same location using Ansible or Puppet. Afterwards modify the spark-env.sh, this script is executed on all nodes when a Spark job runs. So activate the desired virtual env in the spark-env.sh and set the environment variable PYSPARK_PYTHON to the location of python in the desired virtual env.

 

Otherwise another alternative could be to use YARN with Docker Containers. However this requires some research to get it working. However the theoretical idea would be to have the Spark Driver and Executors running in Docker containers provided with the desired python libraries.

 

Fingers crossed ;) 

Announcements