In a CDP environment containing both Spark2 and Spark3, Jupyter notebook will use the default path provided in the builds and will refer to spark2 in this case. We tried adding a new build in Jupyter by providing below json format file where Spark3 was copied over to CDH directory, but it did not work


cat /data1/python3.6.10/share/jupyter/kernels/pyspark3/kernel.json
 "argv": [
 "display_name": "PySpark3",
 "language": "python",
"env":{"JAVA_HOME":"/usr/java/latest","PYSPARK_PYTHON":"/data1/python3.6.10/bin/python3.6","SPARK_HOME":"/opt/cloudera/parcels/CDH/lib/spark3","HADOOP_CONF_DIR":"/opt/cloudera/parcels/CDH/lib/spark3/conf/yarn-conf","SPARK_CONF_DIR":"/opt/cloudera/parcels/CDH/lib/spark3/conf","PYTHONPATH":"/opt/cloudera/parcels/CDH/lib/spark3/python/lib/","PATH":"$SPARK_HOME/bin:$JAVA_HOME/bin:$PATH" ,  "PYTHON_STARTUP":"/opt/cloudera/parcels/CDH/lib/spark3/python/pyspark/","CLASSPATH":"/opt/cloudera/parcels/CDH/lib/spark3/conf/yarn-conf","PYSPARK_SUBMIT_ARGS":" --py-files '/etc/hive/conf/hive-site.xml' --master yarn --name 'Jupyter Notebook' --conf spark.jars.ivy=/tmp/.ivy --queue user_prod  pyspark-shell --jars /tmp/ojdbc8.jar" }

Customer was able to run Spark3 job in Jupyter using below python addition prior to script execution


import os
import sys
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK3/lib/spark3"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/data1/python3.6.10/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/")
sys.path.insert(0, os.environ["PYLIB"] +"/")

These changes can apply where bash.rc file cannot be modified. It will allow Jupyter notebook to use Spark2 by default and Spark3 when above code is inserted.


‎10-14-2022 06:18 AM
