03-23-2016 05:55 AM
03-23-2016 05:58 AM
04-21-2016 01:34 PM
if you installed the anaconda parcel for CDH then one possibility to use the new python environment is this command:
PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python spark-submit <script.py>
alternatively you can enter into the pyspark shell with this command:
shell command: PYSPARK_PYTHON=/var/opt/cloudera/parcels/Anaconda/bin/python pyspark
check that the parcel installation directory is correct for you (/var/opt/teradata/cloudera/parcels/Anaconda/bin/python)
if you want to change to default python configuration for your whole cluster then follow this guide:
07-07-2016 12:26 AM
I am not sure if what you want to achieve is possible yet using different virtual envs on the master and worker nodes. However you could try to create virtual envs on all the nodes at the same location using Ansible or Puppet. Afterwards modify the spark-env.sh, this script is executed on all nodes when a Spark job runs. So activate the desired virtual env in the spark-env.sh and set the environment variable PYSPARK_PYTHON to the location of python in the desired virtual env.
Otherwise another alternative could be to use YARN with Docker Containers. However this requires some research to get it working. However the theoretical idea would be to have the Spark Driver and Executors running in Docker containers provided with the desired python libraries.
Fingers crossed ;)