Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can't get Pyspark interpreter to work on Zeppelin

avatar
Contributor

Hi,

I've been trying unsuccessfully to configure the pyspark interpreter on Zeppelin. I can use pyspark from the CLI and can use the Spark interpreter from Zeppelin without issue. Here are the lines which aren't commented out in my zeppelin-env.sh file:

export MASTER=yarn-client

export ZEPPELIN_PORT=8090

export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"

export SPARK_HOME=/usr/hdp/current/spark-client/

export HADOOP_CONF_DIR=/etc/hadoop/conf

export PYSPARK_PYTHON=/usr/bin/python

export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH

Running a simple pyspark script in the interpreter gives this error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, some_yarn_node.networkname): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /app/hadoop/yarn/local/usercache/my_username/filecache/4121/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar

I've tried adding this line to zeppelin-env.sh, which gives the same error above:

export PYTHONPATH=/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/pyspark.zip:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip

I've tried everything I could find on Google, any advice for debugging or fixing this problem?

Thanks,

Ian

Also, in case it's useful for debugging here are some commands and outputs below:

System.getenv().get("MASTER")

System.getenv().get("SPARK_YARN_JAR")

System.getenv().get("HADOOP_CONF_DIR")

System.getenv().get("JAVA_HOME")

System.getenv().get("SPARK_HOME")

System.getenv().get("PYSPARK_PYTHON")

System.getenv().get("PYTHONPATH")

System.getenv().get("ZEPPELIN_JAVA_OPTS")

res49: String = yarn-client

res50: String = null

res51: String = /etc/hadoop/conf

res52: String = /usr/jdk64/jdk1.7.0_45

res53: String = /usr/hdp/2.3.2.0-2950/spark

res54: String = /usr/bin/python

res55: String = /usr/hdp/2.3.2.0-2950/spark/python:/usr/hdp/2.3.2.0-2950/spark/python/build:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:

res56: String = -Dhdp.version=2.3.2.0-2950

1 ACCEPTED SOLUTION

avatar
Contributor

There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.

View solution in original post

10 REPLIES 10

avatar
Master Mentor
@Ian Maloney

https://issues.apache.org/jira/browse/SPARK-6411

https://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3CCAMJOb8kcGk0PqiOGJu6UoKCeysWCuSW...

That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its contents can be read. In my experience, if I am able to import the pyspark module by explicitly specifying the PYTHONPATH this way, then I can run pyspark on YARN without fail.

avatar
Master Mentor

avatar
Contributor

@Neeraj Sabharwal

The Jira issue and tutorial in your comments are completely unrelated to my issue. I previously found the link to the Apache mail archives. It's about using pyspark on yarn, which I can do via the CLI. The only problem is with Zeppelin. It ignores the pythonpath in zeppelin-env.sh (the pythonpath is the same as in spark-env.sh).

avatar
Contributor

@Neeraj Sabharwal

I've also tried adding the pythonpath directly in the interpreter configs from the Zeppeling GUI, by creating a variable zeppelin.pyspark.pythonpath. I even tried exporting the PYTHONPATH variable from the Linux CLI. None of these worked. What bothers me, is that the pythonpath is not changing, and I'm always getting the same error shown above.

avatar

@Ian Maloney

I probably suffers the same, trying to upgrade python from ver 2.6.6 to Anaconda3 Python ver 3.5. This is why I wondered what is the difference between changing zeppelin.pyspark.pythonpath if PYSPARK_PYTHON was already changed in zeppelin-env.sh.

What is more and was mentioned by you, should I also change pythonpath in spark-env.sh? I did not change it before. Peter

avatar
Contributor

@Piotr Kuźmiak

What I had to do in order to resolve was clone the latest zeppelin from: https://github.com/apache/incubator-zeppelin Build it using maven and then update my zeppelin-env.sh and put the port number I wanted in zeppelin-site.xml

I didn't have to change anything in the Zeppelin GUI. Here is what is set in my zeppelin-env.sh:

export MASTER=yarn-client

export ZEPPELIN_PORT=8090

export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"

export SPARK_HOME=/usr/hdp/current/spark-client/

export HADOOP_CONF_DIR=/etc/hadoop/conf

export PYSPARK_PYTHON=/usr/bin/python

export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH

avatar
Contributor

There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.

avatar

Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin. Also, see hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ where pyspark interpreter is used.

avatar
New Contributor

Sandbox 2.4 has deployed Python 2.6.6 (no idea why) is it caused issues with the PySpark based Zeppelin demo notebooks. The way how to fix it is deploy new Python (Anaconda package etc.), add it into PATH, change PYSPARK_PYTHON in zeppelin-env.sh and also in interpreter settings in Zeppelin notebook ("python" has to be replaced by path to new python eq. /opt/anaconda2/bin/python2.7 etc.)