Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

Highlighted

Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

Hi All,

 

I am having trouble using Hive from Spark 2.0 on JupyterHub.

 

It works in pyspark2 or spark2-submit but not in JupyterHub.

 

It works in JupyterHub for an older version 1.6 of Spark (and I do not remember that I had to do anything to make it work).

 

Apparently some environmental variable is missing from JupyterHub Spark 2.0 kernel:

=========

$ cat /usr/local/share/jupyter/kernels/pyspark2/kernel.json
{
"display_name": "pySpark (Spark 2.0.0)",
"language": "python",
"argv": [
 "/usr/bin/python",
 "-m",
 "ipykernel",
 "-f",
 "{connection_file}"
],
"env": {
 "PYSPARK_PYTHON": "/usr/bin/python",
 "SPARK_HOME": "/opt/cloudera/parcels/SPARK2/lib/spark2",
 "HADOOP_CONF_DIR": "/etc/hadoop/conf",
 "PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.3-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",
 "PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
 "PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}
}

=========

Is there a way to set path to metastore manually from inside a shell? What parameter controls it? I think, it is trying to look into $HOME.

 

The command that I am trying to execute is:

=========

sqlCtx.sql("show tables").show()

=========

It returns the expected list of tables when used inside pyspark2 shell but returns empty list in JupyterHub.

 

Thank you,

Igor

 

 

2 REPLIES 2

Re: Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

I solved this problem. I looked at the environment set in 

/etc/spark2/conf/yarn-conf/hive-env.sh

and set the corresponding variables in JupyterHub kernel. In particular:

 

 "HADOOP_CONF_DIR":"/etc/spark2/conf/yarn-conf",
 "HIVE_AUX_JARS_PATH":"/usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar",
 "HADOOP_CLIENT_OPTS":"-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true",

I think, HADOOP_CONF_DIR is the most important one because previously I had it set to a different directory that does not have hive-site.xml.

Re: Using Hive from Spark 2.0.0 in JupyterHub, CDH 5.10.0.1

New Contributor
Hi Igor,

Thanks for your post!

It help me setup pyspark kernel for remote cluster connectivity, our remote cluster is using CDH 5.14.4

Ping
Don't have an account?
Coming from Hortonworks? Activate your account here