question Re: Can't get Pyspark interpreter to work on Zeppelin in Archives of Support Questions (Read Only)

Can't get Pyspark interpreter to work on Zeppelin

rachmaninovquar — Fri, 12 Feb 2016 03:31:45 GMT

Hi,

I've been trying unsuccessfully to configure the pyspark interpreter on Zeppelin. I can use pyspark from the CLI and can use the Spark interpreter from Zeppelin without issue. Here are the lines which aren't commented out in my zeppelin-env.sh file:

export MASTER=yarn-client

export ZEPPELIN_PORT=8090

export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"

export SPARK_HOME=/usr/hdp/current/spark-client/

export HADOOP_CONF_DIR=/etc/hadoop/conf

export PYSPARK_PYTHON=/usr/bin/python

export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH

Running a simple pyspark script in the interpreter gives this error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, some_yarn_node.networkname): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /app/hadoop/yarn/local/usercache/my_username/filecache/4121/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar

I've tried adding this line to zeppelin-env.sh, which gives the same error above:

export PYTHONPATH=/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/pyspark.zip:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip

I've tried everything I could find on Google, any advice for debugging or fixing this problem?

Thanks,

Ian

Also, in case it's useful for debugging here are some commands and outputs below:

System.getenv().get("MASTER")

System.getenv().get("SPARK_YARN_JAR")

System.getenv().get("HADOOP_CONF_DIR")

System.getenv().get("JAVA_HOME")

System.getenv().get("SPARK_HOME")

System.getenv().get("PYSPARK_PYTHON")

System.getenv().get("PYTHONPATH")

System.getenv().get("ZEPPELIN_JAVA_OPTS")

res49: String = yarn-client

res50: String = null

res51: String = /etc/hadoop/conf

res52: String = /usr/jdk64/jdk1.7.0_45

res53: String = /usr/hdp/2.3.2.0-2950/spark

res54: String = /usr/bin/python

res55: String = /usr/hdp/2.3.2.0-2950/spark/python:/usr/hdp/2.3.2.0-2950/spark/python/build:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:

res56: String = -Dhdp.version=2.3.2.0-2950

Re: Can't get Pyspark interpreter to work on Zeppelin

nsabharwal — Fri, 12 Feb 2016 06:00:44 GMT

@Ian Maloney

https://issues.apache.org/jira/browse/SPARK-6411

https://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3CCAMJOb8kcGk0PqiOGJu6UoKCeysWCuSW3xwd5wRs8ikpMgD2DAg@mail.gmail.com%3E

That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its contents can be read. In my experience, if I am able to import the pyspark module by explicitly specifying the PYTHONPATH this way, then I can run pyspark on YARN without fail.

Re: Can't get Pyspark interpreter to work on Zeppelin

nsabharwal — Fri, 12 Feb 2016 07:00:01 GMT

See this tutorial http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/

Re: Can't get Pyspark interpreter to work on Zeppelin

rachmaninovquar — Fri, 12 Feb 2016 23:43:00 GMT

@Neeraj Sabharwal

The Jira issue and tutorial in your comments are completely unrelated to my issue. I previously found the link to the Apache mail archives. It's about using pyspark on yarn, which I can do via the CLI. The only problem is with Zeppelin. It ignores the pythonpath in zeppelin-env.sh (the pythonpath is the same as in spark-env.sh).

Re: Can't get Pyspark interpreter to work on Zeppelin

rachmaninovquar — Fri, 12 Feb 2016 23:43:18 GMT

@Neeraj Sabharwal

I've also tried adding the pythonpath directly in the interpreter configs from the Zeppeling GUI, by creating a variable zeppelin.pyspark.pythonpath. I even tried exporting the PYTHONPATH variable from the Linux CLI. None of these worked. What bothers me, is that the pythonpath is not changing, and I'm always getting the same error shown above.

Re: Can't get Pyspark interpreter to work on Zeppelin

rachmaninovquar — Fri, 26 Feb 2016 03:54:52 GMT

There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.

Re: Can't get Pyspark interpreter to work on Zeppelin

piotr_kuzmiak — Fri, 26 Feb 2016 19:29:28 GMT

@Ian Maloney

I probably suffers the same, trying to upgrade python from ver 2.6.6 to Anaconda3 Python ver 3.5. This is why I wondered what is the difference between changing zeppelin.pyspark.pythonpath if PYSPARK_PYTHON was already changed in zeppelin-env.sh.

What is more and was mentioned by you, should I also change pythonpath in spark-env.sh? I did not change it before. Peter

Re: Can't get Pyspark interpreter to work on Zeppelin

rachmaninovquar — Fri, 26 Feb 2016 23:08:12 GMT

@Piotr Kuźmiak

What I had to do in order to resolve was clone the latest zeppelin from: https://github.com/apache/incubator-zeppelin Build it using maven and then update my zeppelin-env.sh and put the port number I wanted in zeppelin-site.xml

I didn't have to change anything in the Zeppelin GUI. Here is what is set in my zeppelin-env.sh:

export MASTER=yarn-client

export ZEPPELIN_PORT=8090

export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"

export SPARK_HOME=/usr/hdp/current/spark-client/

export HADOOP_CONF_DIR=/etc/hadoop/conf

export PYSPARK_PYTHON=/usr/bin/python

export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH

Re: Can't get Pyspark interpreter to work on Zeppelin

rhryniewicz — Thu, 03 Mar 2016 11:14:35 GMT

Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin. Also, see hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ where pyspark interpreter is used.

Re: Can't get Pyspark interpreter to work on Zeppelin

jan_rock — Thu, 17 Mar 2016 21:22:32 GMT

Sandbox 2.4 has deployed Python 2.6.6 (no idea why) is it caused issues with the PySpark based Zeppelin demo notebooks. The way how to fix it is deploy new Python (Anaconda package etc.), add it into PATH, change PYSPARK_PYTHON in zeppelin-env.sh and also in interpreter settings in Zeppelin notebook ("python" has to be replaced by path to new python eq. /opt/anaconda2/bin/python2.7 etc.)

Re: Can't get Pyspark interpreter to work on Zeppelin

jan_rock — Thu, 17 Mar 2016 21:22:47 GMT