Support Questions
Find answers, ask questions, and share your expertise

Jupyter with pyspark is not elastic on YARN

Jupyter with pyspark is not elastic on YARN

Expert Contributor

I have installed JupyterHub and the Notebook and integrated with LDAP and also created PySpark kernel jupyter-pyspark-kernel.txt

This setup works well, but when the PySpark shell launched from Jupyter it is holding 3 containers and 23 vcores from YARN


Whatever the job we run it is running with in this containers and it is not using the remaining resources in the cluster even though resources are available and neither it is releasing the resources when there is no jobs running.

We do have spark dynamic allocation enabled:

"PYSPARK_SUBMIT_ARGS" : "--master yarn pyspark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.driver.memory=50G --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.maxExecutors=40 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.executor.heartbeatInterval=600s --conf  spark.executor.memory=50G --conf spark.kryoserializer.buffer=64k --conf spark.kryoserializer.buffer.max=64m --conf --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.broadcastTimeout=1800 --conf --conf spark.yarn.driver.memoryOverhead=3072 --conf spark.yarn.executor.memoryOverhead=3072 --conf  spark.yarn.queue=data-science-queue"

Is there a way that we can let Jupyter know to pass any parameters to PySpark which will take the proper dynamic allocation of Spark on Yarn.




Re: Jupyter with pyspark is not elastic on YARN


Can you share the YARN queue configs for "data-science-queue"

To debug can you set the pyspark args in your python code increasing the number of initial executors to see if it consumes more than 23 vcores.

memory = '4g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

Re: Jupyter with pyspark is not elastic on YARN

Expert Contributor

Thanks for your reply, it holding the containers and vcores while Jupyter launches PySpark kernel i.e. it is holding the resources even with out any operations.

Don't have an account?