Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can't get Pyspark interpreter to work on Zeppelin

avatar
Contributor

Hi,

I've been trying unsuccessfully to configure the pyspark interpreter on Zeppelin. I can use pyspark from the CLI and can use the Spark interpreter from Zeppelin without issue. Here are the lines which aren't commented out in my zeppelin-env.sh file:

export MASTER=yarn-client

export ZEPPELIN_PORT=8090

export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"

export SPARK_HOME=/usr/hdp/current/spark-client/

export HADOOP_CONF_DIR=/etc/hadoop/conf

export PYSPARK_PYTHON=/usr/bin/python

export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH

Running a simple pyspark script in the interpreter gives this error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, some_yarn_node.networkname): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /app/hadoop/yarn/local/usercache/my_username/filecache/4121/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar

I've tried adding this line to zeppelin-env.sh, which gives the same error above:

export PYTHONPATH=/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/pyspark.zip:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip

I've tried everything I could find on Google, any advice for debugging or fixing this problem?

Thanks,

Ian

Also, in case it's useful for debugging here are some commands and outputs below:

System.getenv().get("MASTER")

System.getenv().get("SPARK_YARN_JAR")

System.getenv().get("HADOOP_CONF_DIR")

System.getenv().get("JAVA_HOME")

System.getenv().get("SPARK_HOME")

System.getenv().get("PYSPARK_PYTHON")

System.getenv().get("PYTHONPATH")

System.getenv().get("ZEPPELIN_JAVA_OPTS")

res49: String = yarn-client

res50: String = null

res51: String = /etc/hadoop/conf

res52: String = /usr/jdk64/jdk1.7.0_45

res53: String = /usr/hdp/2.3.2.0-2950/spark

res54: String = /usr/bin/python

res55: String = /usr/hdp/2.3.2.0-2950/spark/python:/usr/hdp/2.3.2.0-2950/spark/python/build:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:

res56: String = -Dhdp.version=2.3.2.0-2950

1 ACCEPTED SOLUTION

avatar
Contributor

There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.

View solution in original post

10 REPLIES 10

avatar
New Contributor

Sandbox 2.4 has deployed Python 2.6.6 (no idea why) is it caused issues with the PySpark based Zeppelin demo notebooks. The way how to fix it is deploy new Python (Anaconda package etc.), add it into PATH, change PYSPARK_PYTHON in zeppelin-env.sh and also in interpreter settings in Zeppelin notebook ("python" has to be replaced by path to new python eq. /opt/anaconda2/bin/python2.7 etc.)