Support Questions

Biswajit16 · ‎08-24-2017

Hi All,

I have a sample table(stuends1) in HIVE which I want to connect from Spark using JDBC (as Hive is not in same cluster). I was just trying with following code...

def main(args: Array[String]): Unit = {
//Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")

val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .getOrCreate()

val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:hive2://34.223.237.55:10000")
  .option("dbtable", "students1")
  .option("user", "hduser")
  .option("password", "hadoop")
  //.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
  .load()

println("able to connect------------------")

jdbcDF.show

jdbcDF.printSchema()

jdbcDF.createOrReplaceTempView("std")

val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
}

I tried in multiple ways but all the time its showing table structure with column name only. Like ...

+--------------+-------------+-------------+
|students1.name|students1.age|students1.gpa|
+--------------+-------------+-------------+
+--------------+-------------+-------------+

But not data, but able to get data when trying to with dbeaver from my local with SQL query. From spark, jdbcDF.printSchema() also showing proper schema , so I guess no issue with connection.

I am using spark 2.1.1 with HIVE 1.2.1. My sbt.build file is like this ....

libraryDependencies ++= Seq(

"log4j"  %   "log4j"  % "1.2.17",
"org.apache.spark" % "spark-core_2.11"                        % "2.1.1" ,
"org.apache.spark" % "spark-streaming-kafka_2.10"             % "1.6.2",
"org.apache.spark" % "spark-hivecontext-compatibility_2.10"   % "2.0.0-preview",
"org.apache.spark" % "spark-sql_2.11"                         % "2.1.1" ,
"org.apache.spark" % "spark-hive_2.10"                        % "2.1.1",
"org.apache.hive"  % "hive-jdbc"                              % "1.2.1"

}

can any one suggest why I am not getting any output of show(). Thanks in advance...

JordanMoore · ‎08-24-2017

Spark connects to the Hive metastore directly via a HiveContext. It does not (nor should, in my opinion) use JDBC.

First, you must compile Spark with Hive support, then you need to explicitly call enableHiveSupport() on the SparkSession bulider.

Additionally, Spark2 will need you to provide either

1. A hive-site.xml file in the classpath

2. Setting hive.metastore.uris . Refer: https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-...

Additional resources

- https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

- https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-hive-integration.html

View solution in original post

thanhtu3009 · ‎01-16-2018

I have tried it and it returned me an empty table with all corrected column names.

thanhtu3009 · ‎01-16-2018

It won't work on transactional table.

thanhtu3009 · ‎01-16-2018

Hi @Jordan Moore

SparkSession only work on non-transactional tables. With transactional table, it gave me error below :

scala> spark.table("user_tnguy11.table_orc_transactional_test").count() org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#33L]) +- HiveTableScan HiveTableRelation `user_tnguy11`.`table_orc_transactional_test`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [year#17, month#18, s#19] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2431) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2430) at org.apache.spark.sql.Dataset$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.count(Dataset.scala:2430) ... 50 elided Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.sql.execution.exchange.ShuffleExchange$.prepareShuffleDependency(ShuffleExchange.scala:264) at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:87) at org.apache.spark.sql.execution.exchange.ShuffleExchange$anonfun$doExecute$1.apply(ShuffleExchange.scala:124) at org.apache.spark.sql.execution.exchange.ShuffleExchange$anonfun$doExecute$1.apply(ShuffleExchange.scala:115) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 74 more Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0248155_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 111 more Caused by: java.lang.NumberFormatException: For input string: "0248155_0000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310) at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

JordanMoore · ‎01-17-2018

@Tu Nguyen I suggest you post a new question, rather than hijack this one.

Your error does not relate directly to transactional tables, but rather the OrcSplits generated by your table.

How about if you should try to use spark.read.format("orc") from the filesystem?

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 111 more 
Caused by: java.lang.NumberFormatException: For input string: "0248155_0000" 
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

bipin_pradhan · ‎01-23-2018

@Jordan Moore

Hi Jordan ,

Have to develop Stand alone spark submit in Yarn mode using Kerbores cluster .I need to run Hive SQL using spark session.

Could you please help me on this.

Thanks in Advance!!!

Regrads,

Bipin

JordanMoore · ‎01-23-2018

@Bipin Pradhan, please post your question as a brand new post.

Cloudera Community

Support Questions

Spark with HIVE JDBC connection