Reply
Highlighted
Contributor
Posts: 69
Registered: ‎01-24-2017

Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

Hi All,

 

I am having trouble using Hive from Spark 2.0 on JupyterHub.

 

It works in pyspark2 or spark2-submit but not in JupyterHub.

 

It works in JupyterHub for an older version 1.6 of Spark (and I do not remember that I had to do anything to make it work).

 

Apparently some environmental variable is missing from JupyterHub Spark 2.0 kernel:

=========

$ cat /usr/local/share/jupyter/kernels/pyspark2/kernel.json
{
"display_name": "pySpark (Spark 2.0.0)",
"language": "python",
"argv": [
 "/usr/bin/python",
 "-m",
 "ipykernel",
 "-f",
 "{connection_file}"
],
"env": {
 "PYSPARK_PYTHON": "/usr/bin/python",
 "SPARK_HOME": "/opt/cloudera/parcels/SPARK2/lib/spark2",
 "HADOOP_CONF_DIR": "/etc/hadoop/conf",
 "PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.3-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",
 "PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
 "PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}
}

=========

Is there a way to set path to metastore manually from inside a shell? What parameter controls it? I think, it is trying to look into $HOME.

 

The command that I am trying to execute is:

=========

sqlCtx.sql("show tables").show()

=========

It returns the expected list of tables when used inside pyspark2 shell but returns empty list in JupyterHub.

 

Thank you,

Igor

 

 

Contributor
Posts: 69
Registered: ‎01-24-2017

Re: Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

I solved this problem. I looked at the environment set in 

/etc/spark2/conf/yarn-conf/hive-env.sh

and set the corresponding variables in JupyterHub kernel. In particular:

 

 "HADOOP_CONF_DIR":"/etc/spark2/conf/yarn-conf",
 "HIVE_AUX_JARS_PATH":"/usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar",
 "HADOOP_CLIENT_OPTS":"-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true",

I think, HADOOP_CONF_DIR is the most important one because previously I had it set to a different directory that does not have hive-site.xml.

New Contributor
Posts: 1
Registered: ‎11-19-2018

Re: Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

Hi Igor,

Thanks for your post!

It help me setup pyspark kernel for remote cluster connectivity, our remote cluster is using CDH 5.14.4

Ping