Created 04-28-2017 09:51 PM
Hi All,
I am having trouble using Hive from Spark 2.0 on JupyterHub.
It works in pyspark2 or spark2-submit but not in JupyterHub.
It works in JupyterHub for an older version 1.6 of Spark (and I do not remember that I had to do anything to make it work).
Apparently some environmental variable is missing from JupyterHub Spark 2.0 kernel:
=========
$ cat /usr/local/share/jupyter/kernels/pyspark2/kernel.json
{
"display_name": "pySpark (Spark 2.0.0)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/usr/bin/python",
"SPARK_HOME": "/opt/cloudera/parcels/SPARK2/lib/spark2",
"HADOOP_CONF_DIR": "/etc/hadoop/conf",
"PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.3-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",
"PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}
}
=========
Is there a way to set path to metastore manually from inside a shell? What parameter controls it? I think, it is trying to look into $HOME.
The command that I am trying to execute is:
=========
sqlCtx.sql("show tables").show()
=========
It returns the expected list of tables when used inside pyspark2 shell but returns empty list in JupyterHub.
Thank you,
Igor
Created 04-30-2017 10:27 AM
I solved this problem. I looked at the environment set in
/etc/spark2/conf/yarn-conf/hive-env.sh
and set the corresponding variables in JupyterHub kernel. In particular:
"HADOOP_CONF_DIR":"/etc/spark2/conf/yarn-conf",
"HIVE_AUX_JARS_PATH":"/usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar",
"HADOOP_CLIENT_OPTS":"-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true",
I think, HADOOP_CONF_DIR is the most important one because previously I had it set to a different directory that does not have hive-site.xml.
Created 11-19-2018 08:14 PM