Created 02-11-2016 07:31 PM
Hi,
I've been trying unsuccessfully to configure the pyspark interpreter on Zeppelin. I can use pyspark from the CLI and can use the Spark interpreter from Zeppelin without issue. Here are the lines which aren't commented out in my zeppelin-env.sh file:
export MASTER=yarn-client
export ZEPPELIN_PORT=8090
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"
export SPARK_HOME=/usr/hdp/current/spark-client/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export PYSPARK_PYTHON=/usr/bin/python
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH
Running a simple pyspark script in the interpreter gives this error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, some_yarn_node.networkname): org.apache.spark.SparkException: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /app/hadoop/yarn/local/usercache/my_username/filecache/4121/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
I've tried adding this line to zeppelin-env.sh, which gives the same error above:
export PYTHONPATH=/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/pyspark.zip:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip
I've tried everything I could find on Google, any advice for debugging or fixing this problem?
Thanks,
Ian
Also, in case it's useful for debugging here are some commands and outputs below:
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
res49: String = yarn-client
res50: String = null
res51: String = /etc/hadoop/conf
res52: String = /usr/jdk64/jdk1.7.0_45
res53: String = /usr/hdp/2.3.2.0-2950/spark
res54: String = /usr/bin/python
res55: String = /usr/hdp/2.3.2.0-2950/spark/python:/usr/hdp/2.3.2.0-2950/spark/python/build:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:
res56: String = -Dhdp.version=2.3.2.0-2950
Created 02-25-2016 07:54 PM
There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.
Created 03-17-2016 02:22 PM
Sandbox 2.4 has deployed Python 2.6.6 (no idea why) is it caused issues with the PySpark based Zeppelin demo notebooks. The way how to fix it is deploy new Python (Anaconda package etc.), add it into PATH, change PYSPARK_PYTHON in zeppelin-env.sh and also in interpreter settings in Zeppelin notebook ("python" has to be replaced by path to new python eq. /opt/anaconda2/bin/python2.7 etc.)