Support Questions

Find answers, ask questions, and share your expertise

How to connect to Hive usin gSpark (pspark)?

avatar
Explorer

Hello everyone I have a problem I'm trying to work with hive datasets using pyspark and I have 3 databases but I just get the default database it's like it create a new warehouse in same directory of python program Here's my program :

from pyspark.sql import SparkSession

spark = SparkSession \

            .builder \

            .appName("Python Spark SQL Hive integration example") \

            .getOrCreate()

spark.sql("show databases").show()

and here's the output : 

oout.PNG

and here's hive output : 

out2.PNG

I want to connect to hive databases and thanks in advance

 

2 REPLIES 2

avatar
Master Mentor

@totti1 
This all about the HMS  hive Metadata Refreshing Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

 

from os.path import expanduser, join
from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'spark-warehouse'

spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()

# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS totti (key INT, value STRING)")

# Load some data here
spark.sql("LOAD DATA LOCAL INPATH 'path/to/the/table/totti.txt' INTO TABLE totti")

# Refresh the HMS metastore
// spark is an existing SparkSession
spark.catalog.refreshTable("totti")
# Queries are expressed in HiveQL
spark.sql("SELECT * FROM totti").show()

 

In the above example, you will need to connect to the database to create the table totti. Notice I run the refresh before the select so that the Metadata is invalidated and fetched from the databases else I will get no  table found etc

 

avatar
Explorer

Thank you for your reply 

I don't want to use spark warehouse, I want to use hive warehouse the global hive