Support Questions

Find answers, ask questions, and share your expertise

query hive tables with spark sql

avatar

Hi

I have a hadoop single node cluster and also hive installed. And I have one hive database with some hive tables stored in hdfs.

Now I want to do some sql queries in that hive tables using Spark SQL.

Someone already did this? What is the process to achieve this? We need to create again the tables now in spark? Or we can acess direct the hive tables with spark sql? Im trying to find some article about this but it seems always that we need to create again the tables with spark sql, and load data in that tables again, but Im not understanding why if we alredy have this in hive!

5 REPLIES 5

avatar
Master Mentor

avatar
Rising Star

Hi Artem,

I'm currently stuck in a particular use case where in I'm trying to access Hive Table data using spark.read.jdbc as shown below:

export SPARK_MAJOR_VERSION=2

spark-shell

import org.apache.spark.sql.{DataFrame, Row,SparkSession}

val connectionProperties = new java.util.Properties()

val hiveQuery = "(SELECT * from hive_table limit 10) tmp"

val hiveResult = spark.read.jdbc("jdbc:hive2://hiveServerHostname:10000/hiveDBName;user=hive;password=hive", hiveQuery, connectionProperties).collect()

But when I check for the results in hiveResult it's just empty. Could you please suggest what's going on here?

I know we can access Hive tables using HiveSesssion and I've successfully tried that but is it possible to run hive queries and access Hive data using the above method?

avatar

@Jan J

First you need to Create hive-site in the Spark conf directory:

As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting:

<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
    <value>thrift://sandbox.hortonworks.com:9083</value>
    <description>URI for client to contact metastore server</description>
  </property>
</configuration>

Set SPARK_HOME

If you haven’t already, make sure to set SPARK_HOME before proceeding:

export SPARK_HOME=/usr/hdp/current/spark-client

Create Hive Context

At a scala> REPL prompt type the following:

valhiveContext=neworg.apache.spark.sql.hive.HiveContext(sc)

Create Hive Table

hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")

Load example KV value data into Table

hiveContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE TestTable")

scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)

scala>hCTX.hql("DESCRIBE raw_users").collect().foreach(println)

avatar
Guru

+1 azeltov. We can take that one step further an make the Hive tables available through Spark SQL via JDBC from outside the cluster.

SPARK_HOME/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port={port to listen} --hiveconf hive.server2.thrift.bind.host={host to bind to} --master yarn-client

This will start a HiveServer2 instance that has access to the meta store but will turn SQL into Spark instruction sets and RDDs under the covers. You should now be able to use a HiveServer compliant JDBC driver to connect and access the power of SparkSQL and yet leverage all of the existing investment and assets in Hive.

avatar
Explorer

Hi @Jan J

If you have already some cluster with Hive tables in it you don't need to create those tables with Spark once more.

You can just connect to existing. Please try next:

1. Pack your code in jar file and move somewhere to your cluster. Make Hive query calls from SparkSession.sql("YOUR_QUERY").

2. run spark-submit tool with 'driver-java-options' set to local metastore

--driver-java-options "-Dhive.metastore.uris=thrift://localhost:9083"

Best regards,

Olga