Posts: 69
Registered: ‎01-24-2017

Using Hive from Spark 2.0.0 in JupyterHub, CDH

Hi All,


I am having trouble using Hive from Spark 2.0 on JupyterHub.


It works in pyspark2 or spark2-submit but not in JupyterHub.


It works in JupyterHub for an older version 1.6 of Spark (and I do not remember that I had to do anything to make it work).


Apparently some environmental variable is missing from JupyterHub Spark 2.0 kernel:


$ cat /usr/local/share/jupyter/kernels/pyspark2/kernel.json
"display_name": "pySpark (Spark 2.0.0)",
"language": "python",
"argv": [
"env": {
 "PYSPARK_PYTHON": "/usr/bin/python",
 "SPARK_HOME": "/opt/cloudera/parcels/SPARK2/lib/spark2",
 "HADOOP_CONF_DIR": "/etc/hadoop/conf",
 "PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/",
 "PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/",
 "PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"


Is there a way to set path to metastore manually from inside a shell? What parameter controls it? I think, it is trying to look into $HOME.


The command that I am trying to execute is:


sqlCtx.sql("show tables").show()


It returns the expected list of tables when used inside pyspark2 shell but returns empty list in JupyterHub.


Thank you,




Posts: 69
Registered: ‎01-24-2017

Re: Using Hive from Spark 2.0.0 in JupyterHub, CDH

I solved this problem. I looked at the environment set in 


and set the corresponding variables in JupyterHub kernel. In particular:


 "HADOOP_CLIENT_OPTS":"-Xmx2147483648 -XX:MaxPermSize=512M",

I think, HADOOP_CONF_DIR is the most important one because previously I had it set to a different directory that does not have hive-site.xml.