Created on 03-06-2017 04:27 AM - edited 09-16-2022 04:11 AM
Hi,
I have found a general template how to access spark temporary data (id data frame) via an external tool using JDBC. What I have found that it should be quite simple:
1. Run spark-shell or submit spark job
2. Configure HiveContext and then run HiveThirftServer from the job.
In a separate session access the thrift server via beeline and query data.
Here is my code Spark 2.1:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
import org.apache.spark.sql.hive.thriftserver._
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10002")
sql.setConf("hive.server2.authentication","KERBEROS" )
sql.setConf("hive.server2.authentication.kerberos.principal","hive/host1.lab.hadoop.net@LAB.HADOOP.NET" )
sql.setConf("hive.server2.authentication.kerberos.keytab","/home/h.keytab" )
sql.setConf("spark.sql.hive.thriftServer.singleSession","true")
val data = sql.sql("select 112 as id")
data.collect
data.createOrReplaceTempView("yyy")
sql.sql("show tables").show
HiveThriftServer2.startWithContext(sql)
WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException Connect to the JDBC server:
beeline -u "jdbc:hive2://localhost:10002/default;principal=hive/host1.lab.hadoop.net@LAB.HADOOP.NET"
However when I try to launch the HiveThriftServer2 I can access the spark thrift but do not see the temporary table. Command "show tables" do not show any temporary table. Trying to query "yyy" throws an error:
scala> sql.sql("show tables").collect
res11: Array[org.apache.spark.sql.Row] = Array([,sometablename,true], [,yyy,true])
scala> 17/03/06 11:15:50 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query, currentState RUNNING,
org.apache.spark.sql.AnalysisException: Table or view not found: yyy; line 1 pos 14
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:459)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:478)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
If I create a table from beeline via "create table t as select 100 as id" the table is created and I can see it in spark-shell (data stored locally in spark-warehouse directory) So the other direction is working.
So the question what I am missing, why I can't see the temporary table?
Thanks
Created 03-08-2017 03:52 AM
I have found out what was the problem. The solution is to set the singleSession property to true in command line. Because setting it in a program seems to not work properly.
/bin/spark-shell --conf spark.sql.hive.thriftServer.singleSession=true
WORKS.
/bin/spark-shell
...
sql.setConf("spark.sql.hive.thriftServer.singleSession","true")
...
DOES NOT WORK
Created 03-08-2017 03:52 AM
I have found out what was the problem. The solution is to set the singleSession property to true in command line. Because setting it in a program seems to not work properly.
/bin/spark-shell --conf spark.sql.hive.thriftServer.singleSession=true
WORKS.
/bin/spark-shell
...
sql.setConf("spark.sql.hive.thriftServer.singleSession","true")
...
DOES NOT WORK
Created 04-13-2018 04:04 PM
I am using spark 2.0.2. Can you help me with build.sbt file.