question Re: Spark with HIVE JDBC connection in Support Questions

Spark with HIVE JDBC connection

Biswajit16 — Thu, 24 Aug 2017 15:17:41 GMT

Hi All,

I have a sample table(stuends1) in HIVE which I want to connect from Spark using JDBC (as Hive is not in same cluster). I was just trying with following code...

def main(args: Array[String]): Unit = {
//Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")

val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .getOrCreate()

val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:hive2://34.223.237.55:10000")
  .option("dbtable", "students1")
  .option("user", "hduser")
  .option("password", "hadoop")
  //.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
  .load()

println("able to connect------------------")

jdbcDF.show

jdbcDF.printSchema()

jdbcDF.createOrReplaceTempView("std")

val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
}

I tried in multiple ways but all the time its showing table structure with column name only. Like ...

+--------------+-------------+-------------+
|students1.name|students1.age|students1.gpa|
+--------------+-------------+-------------+
+--------------+-------------+-------------+

But not data, but able to get data when trying to with dbeaver from my local with SQL query. From spark, jdbcDF.printSchema() also showing proper schema , so I guess no issue with connection.

I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this ....

libraryDependencies ++= Seq(

"log4j"  %   "log4j"  % "1.2.17",
"org.apache.spark" % "spark-core_2.11"                        % "2.1.1" ,
"org.apache.spark" % "spark-streaming-kafka_2.10"             % "1.6.2",
"org.apache.spark" % "spark-hivecontext-compatibility_2.10"   % "2.0.0-preview",
"org.apache.spark" % "spark-sql_2.11"                         % "2.1.1" ,
"org.apache.spark" % "spark-hive_2.10"                        % "2.1.1",
"org.apache.hive"  % "hive-jdbc"                              % "1.2.1"

}

can any one suggest why I am not getting any output of show(). Thanks in advance...

Re: Spark with HIVE JDBC connection

JordanMoore — Fri, 25 Aug 2017 01:29:47 GMT

Spark connects to the Hive metastore directly via a HiveContext. It does not (nor should, in my opinion) use JDBC.

First, you must compile Spark with Hive support, then you need to explicitly call enableHiveSupport() on the SparkSession bulider.

Additionally, Spark2 will need you to provide either

1. A hive-site.xml file in the classpath

2. Setting hive.metastore.uris . Refer: https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql

Additional resources

- https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

- https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-hive-integration.html

Re: Spark with HIVE JDBC connection

Biswajit16 — Fri, 25 Aug 2017 19:23:25 GMT

Thanks Jodan... you saved my day ....thanks a lot .... 🙂

Re: Spark with HIVE JDBC connection

casel_chen — Tue, 10 Oct 2017 12:30:38 GMT

What about reading external Hive Data by JDBC from Spark SQL?

Re: Spark with HIVE JDBC connection

JordanMoore — Wed, 11 Oct 2017 03:52:03 GMT

@CaselChen Again, Spark connects directly to the HiveMetastore - using JDBC requires you to go through HiveServer2

Re: Spark with HIVE JDBC connection

thanhtu3009 — Tue, 16 Jan 2018 00:25:30 GMT

Hi Jordan, is there anyway by using JDBC through HiveServer2 with spark, we can load Hive data for non-transactional as well as transactional table ? I have tried it but my tables were empty after loading. All table structures returned ( column names ) are fine though. Thanks.

Re: Spark with HIVE JDBC connection

JordanMoore — Tue, 16 Jan 2018 00:54:35 GMT

@Tu Nguyen

- I'm afraid I don't understand your question. Spark does not use JDBC to communicate with Hive, but it can load Hive with any type of data that's able to be represented as a Spark DataSet.

You may want to try a "MSCK REPAIR TABLE <tablename>;" in Hive, though

Re: Spark with HIVE JDBC connection

thanhtu3009 — Tue, 16 Jan 2018 02:06:09 GMT

Sorry Jordan, I was not clear. As spark context is not supporting Hive Transactional tables. I am trying to use SparkSession to load a hive transactional table through JDBC as below

spSession.read.format("jdbc").option("url", url).option("driver", "org.apache.hive.jdbc.HiveDriver").option("dbtable", "tnguy.table_transactional_test").load().show()

The result is an empty table but I was expecting 3 rows of data.

I also tried it on non-transactional table but it gave the same result. I am wondering if we can use hive JDBC to load a hive data ?

Thanks,

Re: Spark with HIVE JDBC connection

JordanMoore — Tue, 16 Jan 2018 10:29:19 GMT

@Tu Nguyen Where are you reading that you need to use JDBC from Spark to communicate with Hive? It isn't in the SparkSQL documentation.

https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

1. Try using an alternative JDBC client, see if you get similar results.

2. What happens when you simply use the following?

val spark = SparkSession
  .builder()
  .appName("Spark Transactional Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

spark.table("tnguy.table_transactional_test").count()

Re: Spark with HIVE JDBC connection

casel_chen — Tue, 16 Jan 2018 12:12:30 GMT

I guess Tu Nguyen want to load external Hive table into Spark, right? If so, think about the following code

object SparkHiveJdbc extends App {
  val spark = SparkSession.builder.master("local[2]").appName("SparkHiveJob").getOrCreate
  val sc = spark.sparkContext
  val sqlContext = spark.sqlContext
  val driverName = "org.apache.hive.jdbc.HiveDriver"
  Class.forName(driverName)
  val df = spark.read
    .format("jdbc")
    .option("url", "jdbc:hive2://localhost:10000/default")
    .option("dbtable", "clicks_json")
    .load()
  df.printSchema()
  println(df.count())
  df.show()
}

I run the above code and met error reported as

root
|-- clicks_json.ip: string (nullable = true) 
|-- clicks_json.timestamp: long (nullable = true) 
|-- clicks_json.url: string (nullable = true) 
|-- clicks_json.referrer: string (nullable = true) 
|-- clicks_json.useragent: string (nullable = true) 
|-- clicks_json.sessionid: integer (nullable = true) 
|-- clicks_json.tenantid: string (nullable = true) 
|-- clicks_json.datestamp: string (nullable = true) 

998

Caused by: java.lang.NumberFormatException: For input string: "clicks_json.timestamp" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hive.jdbc.HiveBaseResultSet.getLong(HiveBaseResultSet.java:368) ... 23 more<br>

The reason of error I think is spark load header title row as first row when convert ResultSet into internal Row object.
Anything wrong here?

Re: Spark with HIVE JDBC connection

thanhtu3009 — Wed, 17 Jan 2018 03:55:14 GMT

I have tried it and it returned me an empty table with all corrected column names.

Re: Spark with HIVE JDBC connection

thanhtu3009 — Wed, 17 Jan 2018 03:56:55 GMT

It won't work on transactional table.

Re: Spark with HIVE JDBC connection

thanhtu3009 — Wed, 17 Jan 2018 05:30:15 GMT

Hi @Jordan Moore

SparkSession only work on non-transactional tables. With transactional table, it gave me error below :

scala> spark.table("user_tnguy11.table_orc_transactional_test").count() org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#33L]) +- HiveTableScan HiveTableRelation `user_tnguy11`.`table_orc_transactional_test`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [year#17, month#18, s#19] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2431) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2430) at org.apache.spark.sql.Dataset$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.count(Dataset.scala:2430) ... 50 elided Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.sql.execution.exchange.ShuffleExchange$.prepareShuffleDependency(ShuffleExchange.scala:264) at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:87) at org.apache.spark.sql.execution.exchange.ShuffleExchange$anonfun$doExecute$1.apply(ShuffleExchange.scala:124) at org.apache.spark.sql.execution.exchange.ShuffleExchange$anonfun$doExecute$1.apply(ShuffleExchange.scala:115) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 74 more Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0248155_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 111 more Caused by: java.lang.NumberFormatException: For input string: "0248155_0000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310) at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Re: Spark with HIVE JDBC connection

JordanMoore — Wed, 17 Jan 2018 11:19:27 GMT

@Tu Nguyen I suggest you post a new question, rather than hijack this one.

Your error does not relate directly to transactional tables, but rather the OrcSplits generated by your table.

How about if you should try to use spark.read.format("orc") from the filesystem?

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 111 more 
Caused by: java.lang.NumberFormatException: For input string: "0248155_0000" 
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

Re: Spark with HIVE JDBC connection

bipin_pradhan — Tue, 23 Jan 2018 11:53:50 GMT

@Jordan Moore

Hi Jordan ,

Have to develop Stand alone spark submit in Yarn mode using Kerbores cluster .I need to run Hive SQL using spark session.

Could you please help me on this.

Thanks in Advance!!!

Regrads,

Bipin

Re: Spark with HIVE JDBC connection

JordanMoore — Wed, 24 Jan 2018 02:47:25 GMT

@Bipin Pradhan, please post your question as a brand new post.