Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to find Hive database from Pyspark scripts

Solved Go to solution

Unable to find Hive database from Pyspark scripts

New Contributor

I am using the HDP 2.6.5 Sandbox on Docker/Windows 10.

Using the Hive 2 view, I created a database and table. Let's call them "sampledb" with a table called "sampletable" in HDFS location /sampledb/sampletable/. I stored data in the form of an ORC file in the appropriate directory, and invoked `msck repair sampledb.sampletable`. From Ambari Hive View and Hive View 2.0 I am able to successfully read data from sampletable.

In Zeppelin I wrote a Pyspark script that uses the Spark SQL interface to read data from sampletable. It works, no problem. Note, this is Spark 2, not Spark 1.6.

We do not want to run Zeppelin scripts in production, so I converted the script to standard Python for use with spark-submit. The Spark initialization code is below:

settings = [
    ("hive.exec.dynamic.partition", "true"),
    ("hive.exec.dynamic.partition.mode", "nonstrict"),
    ("spark.sql.orc.filterPushdown", "true"),
    ("hive.msck.path.validation", "ignore"),
    ("spark.sql.caseSensitive", "true"),
    ("spark.speculation", "false"),
    ] 
spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
self.spark = SparkSession.builder. \
    enableHiveSupport(). \
    config(conf = spark_conf). \
    getOrCreate()

Python dependencies are zipped into spark.zip and the script is called sample.py. After copying the script to my /home/ dir, I attempt run it as follows:

# spark-submit --py-files ./spark.zip sample.py

The relevant error output, so far as I can tell, are the following console outputs:

SharedState:54 - Warehouse path is 'file:/home/raj_ops/spark-warehouse'.
...
HiveClientImpl:54 - Warehouse location for Hive client (version 1.2.2) is file:/home/raj_ops/spark-warehouse
...
Failed to get database sampledb, returning NoSuchObjectException                                              
u"Table or view not found: `sampledb`.`sampletable`

The only thing I had to show for my effort was a new /spark-warehouse sub-directory within /home/raj_ops/. So I tried adding the following line to the Python script:

("spark.sql.warehouse.dir","/apps/hive/warehouse")

So now the full configuration is as follows:

settings = [
    ("hive.exec.dynamic.partition", "true"),
    ("hive.exec.dynamic.partition.mode", "nonstrict"),
    ("spark.sql.orc.filterPushdown", "true"),
    ("hive.msck.path.validation", "ignore"),
    ("spark.sql.caseSensitive", "true"),
    ("spark.speculation", "false"),
    ("spark.sql.warehouse.dir","/apps/hive/warehouse")
    ] 
spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)

The invocation of spark-submit was the same. The outcome was similarly unsuccessful.

What else do I need to do to get the queries from Pyspark scripts to access the Hive database and tables successfully?

Thanks in advance!

Chris Falter

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Unable to find Hive database from Pyspark scripts

New Contributor

I was able to resolve the problem by manually editing my Python script to incorporate the settings from /etc/hive/conf/hive-site.xml. What worked is the following code:

        settings = [
                ("hive.exec.dynamic.partition", "true"),
                ("hive.exec.dynamic.partition.mode", "nonstrict"),
                ("spark.sql.orc.filterPushdown", "true"),
                ("hive.msck.path.validation", "ignore"),
                ("spark.sql.caseSensitive", "true"),
                ("spark.speculation", "false"),
                ("hive.metastore.authorization.storage.checks", "false"),
                ("hive.metastore.cache.pinobjtypes", "Table,Database,Type,FieldSchema,Order"),
                ("hive.metastore.client.connect.retry.delay", "5s"),
                ("hive.metastore.client.socket.timeout", "1800s"),
                ("hive.metastore.connect.retries", "12"),
                ("hive.metastore.execute.setugi", "false"),
                ("hive.metastore.failure.retries", "12"),
                ("hive.metastore.kerberos.keytab.file", "/etc/security/keytabs/hive.service.keytab"),
                ("hive.metastore.kerberos.principal", "hive/_HOST@EXAMPLE.COM"),
                ("hive.metastore.pre.event.listeners", "org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener"),
                ("hive.metastore.sasl.enabled", "false"),
                ("hive.metastore.schema.verification", "false"),
                ("hive.metastore.schema.verification.record.version", "false"),
                ("hive.metastore.server.max.threads", "100000"),
                ("hive.metastore.uris", "thrift://sandbox-hdp.hortonworks.com:9083"),
                ("hive.metastore.warehouse.dir", "thrift://sandbox-hdp.hortonworks.com:9083"),
                ("hive.metastore.authorization.storage.checks", "/apps/hive/warehouse")
                ] 
        spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
        self.spark = SparkSession.builder. \
            config(conf = spark_conf). \
            enableHiveSupport(). \
            getOrCreate()

What did *not* work was copying /etc/hive/conf/hive-site.xml to /etc/spark2/conf/hive-site.xml. I don't know why this didn't work, but it didn't.

I hope this resolution helps someone else!

Chris Falter

4 REPLIES 4

Re: Unable to find Hive database from Pyspark scripts

@Christopher Falter

What is the master you are running the job with?

Make sure hive-site.xml is present in /etc/spark/conf/ and if it is a yarn-cluster mode please pass the hive-site.xml using --files parameter in spark-submit command.

Re: Unable to find Hive database from Pyspark scripts

New Contributor

Hi Sandeep,

I appreciate your response. I followed both of your suggestions:

# cp /etc/hive/conf/hive-site.xml /etc/spark/conf/

This did not result in a change in behavior

# spark-submit --py-files ./spark.zip --files /etc/spark/conf/hive-site.xml sample.py

No change there, either.

I also tried to create the sampletable in the default database, instead. Same result, different error message:

'UnresolvedRelation `default`.`sampletable`

I suspect that the job being submitted is looking at a different Hive metastore than the one I am viewing in Ambari > Hive 2.0 View....even though I am seeing the output that would indicate, IMO, that the Spark job is using the correct Hive metastore:

SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/apps/hive/warehouse').                                                 
2018-09-19 15:29:59 INFO  SharedState:54 - Warehouse path is '/apps/hive/warehouse'

I am using the default master for, which is 'local[*]'.

Any other ideas?

Re: Unable to find Hive database from Pyspark scripts

New Contributor

I was able to resolve the problem by manually editing my Python script to incorporate the settings from /etc/hive/conf/hive-site.xml. What worked is the following code:

        settings = [
                ("hive.exec.dynamic.partition", "true"),
                ("hive.exec.dynamic.partition.mode", "nonstrict"),
                ("spark.sql.orc.filterPushdown", "true"),
                ("hive.msck.path.validation", "ignore"),
                ("spark.sql.caseSensitive", "true"),
                ("spark.speculation", "false"),
                ("hive.metastore.authorization.storage.checks", "false"),
                ("hive.metastore.cache.pinobjtypes", "Table,Database,Type,FieldSchema,Order"),
                ("hive.metastore.client.connect.retry.delay", "5s"),
                ("hive.metastore.client.socket.timeout", "1800s"),
                ("hive.metastore.connect.retries", "12"),
                ("hive.metastore.execute.setugi", "false"),
                ("hive.metastore.failure.retries", "12"),
                ("hive.metastore.kerberos.keytab.file", "/etc/security/keytabs/hive.service.keytab"),
                ("hive.metastore.kerberos.principal", "hive/_HOST@EXAMPLE.COM"),
                ("hive.metastore.pre.event.listeners", "org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener"),
                ("hive.metastore.sasl.enabled", "false"),
                ("hive.metastore.schema.verification", "false"),
                ("hive.metastore.schema.verification.record.version", "false"),
                ("hive.metastore.server.max.threads", "100000"),
                ("hive.metastore.uris", "thrift://sandbox-hdp.hortonworks.com:9083"),
                ("hive.metastore.warehouse.dir", "thrift://sandbox-hdp.hortonworks.com:9083"),
                ("hive.metastore.authorization.storage.checks", "/apps/hive/warehouse")
                ] 
        spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
        self.spark = SparkSession.builder. \
            config(conf = spark_conf). \
            enableHiveSupport(). \
            getOrCreate()

What did *not* work was copying /etc/hive/conf/hive-site.xml to /etc/spark2/conf/hive-site.xml. I don't know why this didn't work, but it didn't.

I hope this resolution helps someone else!

Chris Falter

Re: Unable to find Hive database from Pyspark scripts

@Christopher Falter

Sorry i didn't get a chance to test this. Usually with /etc/spark2/conf/hive-site.xml spark should be able to connect to hive.