Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark with HIVE JDBC connection

avatar
Contributor

Hi All,

I have a sample table(stuends1) in HIVE which I want to connect from Spark using JDBC (as Hive is not in same cluster). I was just trying with following code...

def main(args: Array[String]): Unit = {
//Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")

val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .getOrCreate()

val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:hive2://34.223.237.55:10000")
  .option("dbtable", "students1")
  .option("user", "hduser")
  .option("password", "hadoop")
  //.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
  .load()

println("able to connect------------------")

jdbcDF.show

jdbcDF.printSchema()

jdbcDF.createOrReplaceTempView("std")

val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
}

I tried in multiple ways but all the time its showing table structure with column name only. Like ...

+--------------+-------------+-------------+
|students1.name|students1.age|students1.gpa|
+--------------+-------------+-------------+
+--------------+-------------+-------------+

But not data, but able to get data when trying to with dbeaver from my local with SQL query. From spark, jdbcDF.printSchema() also showing proper schema , so I guess no issue with connection.

I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this ....

libraryDependencies ++= Seq(

"log4j"  %   "log4j"  % "1.2.17",
"org.apache.spark" % "spark-core_2.11"                        % "2.1.1" ,
"org.apache.spark" % "spark-streaming-kafka_2.10"             % "1.6.2",
"org.apache.spark" % "spark-hivecontext-compatibility_2.10"   % "2.0.0-preview",
"org.apache.spark" % "spark-sql_2.11"                         % "2.1.1" ,
"org.apache.spark" % "spark-hive_2.10"                        % "2.1.1",
"org.apache.hive"  % "hive-jdbc"                              % "1.2.1"

}

can any one suggest why I am not getting any output of show(). Thanks in advance...

1 ACCEPTED SOLUTION

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
15 REPLIES 15

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

Thanks Jodan... you saved my day ....thanks a lot .... 🙂

avatar
Explorer

What about reading external Hive Data by JDBC from Spark SQL?

avatar
Super Collaborator

@CaselChen Again, Spark connects directly to the HiveMetastore - using JDBC requires you to go through HiveServer2

avatar
Explorer

Hi Jordan, is there anyway by using JDBC through HiveServer2 with spark, we can load Hive data for non-transactional as well as transactional table ? I have tried it but my tables were empty after loading. All table structures returned ( column names ) are fine though. Thanks.

avatar
Super Collaborator
@Tu Nguyen

- I'm afraid I don't understand your question. Spark does not use JDBC to communicate with Hive, but it can load Hive with any type of data that's able to be represented as a Spark DataSet.

You may want to try a "MSCK REPAIR TABLE <tablename>;" in Hive, though

avatar
Explorer

Sorry Jordan, I was not clear. As spark context is not supporting Hive Transactional tables. I am trying to use SparkSession to load a hive transactional table through JDBC as below

spSession.read.format("jdbc").option("url", url).option("driver", "org.apache.hive.jdbc.HiveDriver").option("dbtable", "tnguy.table_transactional_test").load().show()

The result is an empty table but I was expecting 3 rows of data.

I also tried it on non-transactional table but it gave the same result. I am wondering if we can use hive JDBC to load a hive data ?

Thanks,

avatar
Super Collaborator

@Tu Nguyen Where are you reading that you need to use JDBC from Spark to communicate with Hive? It isn't in the SparkSQL documentation.

https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

1. Try using an alternative JDBC client, see if you get similar results.

2. What happens when you simply use the following?

val spark = SparkSession
  .builder()
  .appName("Spark Transactional Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

spark.table("tnguy.table_transactional_test").count()

avatar
Explorer

I guess Tu Nguyen want to load external Hive table into Spark, right? If so, think about the following code

object SparkHiveJdbc extends App {
  val spark = SparkSession.builder.master("local[2]").appName("SparkHiveJob").getOrCreate
  val sc = spark.sparkContext
  val sqlContext = spark.sqlContext
  val driverName = "org.apache.hive.jdbc.HiveDriver"
  Class.forName(driverName)
  val df = spark.read
    .format("jdbc")
    .option("url", "jdbc:hive2://localhost:10000/default")
    .option("dbtable", "clicks_json")
    .load()
  df.printSchema()
  println(df.count())
  df.show()
}

I run the above code and met error reported as

root
|-- clicks_json.ip: string (nullable = true) |-- clicks_json.timestamp: long (nullable = true) |-- clicks_json.url: string (nullable = true) |-- clicks_json.referrer: string (nullable = true) |-- clicks_json.useragent: string (nullable = true) |-- clicks_json.sessionid: integer (nullable = true) |-- clicks_json.tenantid: string (nullable = true) |-- clicks_json.datestamp: string (nullable = true)
998
Caused by: java.lang.NumberFormatException: For input string: "clicks_json.timestamp" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hive.jdbc.HiveBaseResultSet.getLong(HiveBaseResultSet.java:368) ... 23 more<br>

The reason of error I think is spark load header title row as first row when convert ResultSet into internal Row object.
Anything wrong here?