Appear to be having issues running a pyspark script in a virtualenv. Have a fairly simple script
import findsparkfrom pyspark importSparkConffrom pyspark importSparkContextfrom os import environ
environ["HADOOP_CONF_DIR"]="/etc/hadoop/220.127.116.11-78/0"findspark.init("/usr/hdp/current/spark2-client")conf =SparkConf()conf.setMaster('yarn-client')conf.set("spark.hadoop.yarn.resourcemanager.address","HW01.co.local:8050")conf.setAppName('spark-yarn-demo001')sc =SparkContext(conf=conf)def inside(p): x, y = random.random(), random.random()return x*x + y*y <1NUM_SAMPLES=1000count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
.filter(inside).count()print"Pi is roughly %f"%(4.0* count / NUM_SAMPLES)
(I know thespark.hadoop.yarn.resourcemanagerport is usually 8032, but on my HDP 3.1 cluster the default configs are at 8050) and trying to run via spark-submit.
(spark_demo_venv)[hdfs@HW04 tmp]$ which python~/tmp/spark_demo_venv/bin/python(spark_demo_venv)[hdfs@HW04 tmp]$ /usr/hdp/current/spark2-client/bin/spark-submit /home/hdfs/tmp/spark_demo_001.py
Seeing errors like...
...19/08/2714:31:57 INFO BlockManagerInfo:Added broadcast_0_piece0 in memory on HW03.ucera.local:35043(size:4.0 KB, free:366.3 MB)19/08/2714:31:58 WARN TaskSetManager:Lost task 1.0in stage 0.0(TID 1, HW03.ucera.local, executor 1): java.io.IOException:Cannot run program "/home/hdfs/tmp/spark_demo_venv/bin/python": error=13,Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:186)...
The fact that the virtualenv python binary is what spark appears to have trouble finding makes me think that there is something wrong with trying to run pyspark in a virtualenv (can't test without virtualenv, since don't have permissions to pip install on this cluster node). Any other debugging suggestions or fixes for figuring this out?