Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Error while reading data from Hive JDBC in Spark2 from a remote cluster

Error while reading data from Hive JDBC in Spark2 from a remote cluster

New Contributor

Hi,

I'm trying to read a hive table using the Hive JDBC url in spark. Given below are the steps:

export SPARK_MAJOR_VERSION=2

spark-shell

import org.apache.spark.sql.{DataFrame, Row,SparkSession}

val connectionProperties = new java.util.Properties()

val hiveQuery = "(SELECT * from hive_table limit 10) tmp"

val hiveResult = spark.read.jdbc("jdbc:hive2://hiveServerHostname:10000/hiveDBName;user=hive;password=hive", hiveQuery, connectionProperties).collect()

But when I check for the records in hiveResult it shows empty though this table has records when queried in Hive.

Hive Version - 1.2.10000 and Spark 2.0

Any suggestions on why the data is not being read from the Hive table?

I'm trying to read the data from a remote Hive cluster using Hive jdbc. Also, when I check in the logs a query which I used as below in the code:

(SELECT MIN(my_key) as lower, MAX(my_key) as upper FROM my_table)tmp gets converted to something like below:

SELECT "lower","upper" FROM (SELECT MIN(my_key) as lower, MAX(my_key) as upper FROM my_table)tmp

Any ideas around why does this happen and if it can be prevented i.e. if the above query can be fired without the double quotes through JDBC?

@aervits

3 REPLIES 3

Re: Error while reading data from Hive JDBC in Spark2 from a remote cluster

Rising Star

@Shikhar Agarwal

Have you tried reading the Hive table using SparkSession instead? The following code works for me using Spark 2 on HDP 2.5.3:

/usr/hdp/current/spark2-client/bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

val hiveSession = SparkSession
.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.getOrCreate()

val hiveResult = sql("SELECT * FROM timesheet LIMIT 10")
hiveResult.show()

Details for this can be found in the Apache Spark docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Re: Error while reading data from Hive JDBC in Spark2 from a remote cluster

New Contributor

Is it even possible to read the data from Hive using spark.read.jdbc?

Re: Error while reading data from Hive JDBC in Spark2 from a remote cluster

New Contributor

@Shikhar Agarwal

Please mention fetchsize property, It will solve the issue of no data.

fetchsize - The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.

val jdbcDF = spark.read.   
format("jdbc").
option("url","jdbc:hive2://34.223.237.55:10000").   
option("dbtable","students1").
option("user","hduser").
option("password","hadoop").
option("fetchsize","20").load
Don't have an account?
Coming from Hortonworks? Activate your account here