Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark cannot reach hive

Pyspark cannot reach hive

Expert Contributor

In short: I have a working hive on hdp3, which I cannot reach from pyspark, running under yarn (on the same hdp). How do I get pyspark to find my tables?

spark.catalog.listDatabases() only show default, any query run will not show in my hive logs.

This is my code, with spark 2.3.1

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
settings = []
conf = SparkConf().setAppName("Guillaume is here").setAll(settings)
spark = (
    SparkSession
    .builder
    .master('yarn')
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()
)
print(spark.catalog.listDatabases())

Note that `settings` is empty. I though it would be sufficient, because in the logs I see

loading hive config file: file:/etc/spark2/3.0.1.0-187/0/hive-site.xml

and more interestingly

Registering function intersectgroups io.x.x.IntersectGroups

This is a UDF I wrote and added to hive manually. This means that there is some sort of connection done. The only output I get (except logs) is:

[ Database(name=u'default', description=u'default database', locationUri=u'hdfs://HdfsNameService/apps/spark/warehouse')]

I understand that I should set `spark.sql.warehouse.dir` in settings. No matter if I set it to the value I find in hive-site, the path to the database I am interested in (it's not in the default location), its parent, nothing changes.

I put many other config options in settings (including thrift uris), no changes.

I have seen as well that I should copy hive-site.xml into the spark2 conf dir. I did it on all nodes of my cluster, no changes.

My command to run is:

HDP_VERSION=3.0.1.0-187 PYTHONPATH=.:/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip SPARK_HOME=/usr/hdp/current/spark2-client HADOOP_USER_NAME=hive spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip --files /etc/hive/conf/hive-site.xml ./subjanal/anal.py
Don't have an account?
Coming from Hortonworks? Activate your account here