Support Questions
Find answers, ask questions, and share your expertise

Error while reading data from Hive JDBC in Spark2 from a remote cluster

Hi,

I'm trying to read a hive table using the Hive JDBC url in spark. Given below are the steps:

export SPARK_MAJOR_VERSION=2

spark-shell

import org.apache.spark.sql.{DataFrame, Row,SparkSession}

val connectionProperties = new java.util.Properties()

val hiveQuery = "(SELECT * from hive_table limit 10) tmp"

val hiveResult = spark.read.jdbc("jdbc:hive2://hiveServerHostname:10000/hiveDBName;user=hive;password=hive", hiveQuery, connectionProperties).collect()

But when I check for the records in hiveResult it shows empty though this table has records when queried in Hive.

Hive Version - 1.2.10000 and Spark 2.0

Any suggestions on why the data is not being read from the Hive table?

I'm trying to read the data from a remote Hive cluster using Hive jdbc. Also, when I check in the logs a query which I used as below in the code:

(SELECT MIN(my_key) as lower, MAX(my_key) as upper FROM my_table)tmp gets converted to something like below:

SELECT "lower","upper" FROM (SELECT MIN(my_key) as lower, MAX(my_key) as upper FROM my_table)tmp

Any ideas around why does this happen and if it can be prevented i.e. if the above query can be fired without the double quotes through JDBC?

@aervits

3 REPLIES 3

Rising Star

@Shikhar Agarwal

Have you tried reading the Hive table using SparkSession instead? The following code works for me using Spark 2 on HDP 2.5.3:

/usr/hdp/current/spark2-client/bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

val hiveSession = SparkSession
.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.getOrCreate()

val hiveResult = sql("SELECT * FROM timesheet LIMIT 10")
hiveResult.show()

Details for this can be found in the Apache Spark docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Is it even possible to read the data from Hive using spark.read.jdbc?

New Contributor

@Shikhar Agarwal

Please mention fetchsize property, It will solve the issue of no data.

fetchsize - The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.

val jdbcDF = spark.read.   
format("jdbc").
option("url","jdbc:hive2://34.223.237.55:10000").   
option("dbtable","students1").
option("user","hduser").
option("password","hadoop").
option("fetchsize","20").load