Created 02-11-2016 07:31 PM
Hi,
I've been trying unsuccessfully to configure the pyspark interpreter on Zeppelin. I can use pyspark from the CLI and can use the Spark interpreter from Zeppelin without issue. Here are the lines which aren't commented out in my zeppelin-env.sh file:
export MASTER=yarn-client
export ZEPPELIN_PORT=8090
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"
export SPARK_HOME=/usr/hdp/current/spark-client/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export PYSPARK_PYTHON=/usr/bin/python
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH
Running a simple pyspark script in the interpreter gives this error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, some_yarn_node.networkname): org.apache.spark.SparkException: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /app/hadoop/yarn/local/usercache/my_username/filecache/4121/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
I've tried adding this line to zeppelin-env.sh, which gives the same error above:
export PYTHONPATH=/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/pyspark.zip:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip
I've tried everything I could find on Google, any advice for debugging or fixing this problem?
Thanks,
Ian
Also, in case it's useful for debugging here are some commands and outputs below:
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
res49: String = yarn-client
res50: String = null
res51: String = /etc/hadoop/conf
res52: String = /usr/jdk64/jdk1.7.0_45
res53: String = /usr/hdp/2.3.2.0-2950/spark
res54: String = /usr/bin/python
res55: String = /usr/hdp/2.3.2.0-2950/spark/python:/usr/hdp/2.3.2.0-2950/spark/python/build:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip:/usr/hdp/current/spark-client//python/:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:/usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/build:
res56: String = -Dhdp.version=2.3.2.0-2950
Created 02-25-2016 07:54 PM
There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.
Created 02-11-2016 10:00 PM
https://issues.apache.org/jira/browse/SPARK-6411
That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its contents can be read. In my experience, if I am able to import the pyspark module by explicitly specifying the PYTHONPATH this way, then I can run pyspark on YARN without fail.
Created 02-11-2016 11:00 PM
See this tutorial http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/
Created 02-12-2016 03:43 PM
The Jira issue and tutorial in your comments are completely unrelated to my issue. I previously found the link to the Apache mail archives. It's about using pyspark on yarn, which I can do via the CLI. The only problem is with Zeppelin. It ignores the pythonpath in zeppelin-env.sh (the pythonpath is the same as in spark-env.sh).
Created 02-12-2016 03:43 PM
I've also tried adding the pythonpath directly in the interpreter configs from the Zeppeling GUI, by creating a variable zeppelin.pyspark.pythonpath. I even tried exporting the PYTHONPATH variable from the Linux CLI. None of these worked. What bothers me, is that the pythonpath is not changing, and I'm always getting the same error shown above.
Created 02-26-2016 11:29 AM
I probably suffers the same, trying to upgrade python from ver 2.6.6 to Anaconda3 Python ver 3.5. This is why I wondered what is the difference between changing zeppelin.pyspark.pythonpath if PYSPARK_PYTHON was already changed in zeppelin-env.sh.
What is more and was mentioned by you, should I also change pythonpath in spark-env.sh? I did not change it before. Peter
Created 02-26-2016 03:08 PM
What I had to do in order to resolve was clone the latest zeppelin from: https://github.com/apache/incubator-zeppelin Build it using maven and then update my zeppelin-env.sh and put the port number I wanted in zeppelin-site.xml
I didn't have to change anything in the Zeppelin GUI. Here is what is set in my zeppelin-env.sh:
export MASTER=yarn-client
export ZEPPELIN_PORT=8090
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.2.0-2950 -Dspark.yarn.queue=default"
export SPARK_HOME=/usr/hdp/current/spark-client/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export PYSPARK_PYTHON=/usr/bin/python
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/build:$PYTHONPATH
Created 02-25-2016 07:54 PM
There was a bug in Zeppelin, it was fixed by Mina Lee and then committed a day ago.
Created 03-03-2016 03:14 AM
Grab the latest HDP 2.4 Sandbox. It comes with Spark 1.6 & the python interpreter works in Zeppelin. Also, see hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ where pyspark interpreter is used.
Created 03-17-2016 02:22 PM
Sandbox 2.4 has deployed Python 2.6.6 (no idea why) is it caused issues with the PySpark based Zeppelin demo notebooks. The way how to fix it is deploy new Python (Anaconda package etc.), add it into PATH, change PYSPARK_PYTHON in zeppelin-env.sh and also in interpreter settings in Zeppelin notebook ("python" has to be replaced by path to new python eq. /opt/anaconda2/bin/python2.7 etc.)