Support Questions

Find answers, ask questions, and share your expertise

CDSW - Can't access hive Tables from Scala Workbench

avatar
Expert Contributor

Hello,

 

While running a Pilot for CDSW & CDH, i'm struggling running queries on hive table.

 

Scala Workbench 

Test code:

import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext  
import org.apache.spark.sql.Column

val sparkSession = SparkSession.builder.master("local").appName("mapexample").
        enableHiveSupport().getOrCreate()
  
sparkSession.catalog.listTables.show()

val sqlContext = new HiveContext(sc)

sqlContext.sql("describe database default").show
sqlContext.sql("describe formatted default.mytable").show
sc.version

Test return

import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.Column
val sparkSession = SparkSession.builder.master("local").appName("mapexample").
        enableHiveSupport().getOrCreate()
sparkSession.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

val sqlContext = new HiveContext(sc)
sqlContext.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      /user/hive/warehouse|
+-------------------------+--------------------------+

sqlContext.sql("describe formatted default.mytable").show
Name: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Message: Table or view 'mytable' not found in database 'default';
StackTrace:   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)

 

Interesing to note: Access hive though SparkSession returns nothing, while HiveContext does return a description of the default DB that does not match the actual default DB

 

From Hue

default
Default Hive database
hdfs://<masked>/user/hive/warehouse
public
ROLE

The CDSW host has the right gateway roles installed, hive-site.xml is present with the spark config files.

Hive Metastore log does not register any access when the workbench is trying to access Hive

No Kerberos involved

 

I ran out of option to check, hence this post

 

Thanks

Chris

 

 

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello,

 

I fixed it.

 

In the Spark2 Configuration Screen (In Cloudera Manager for the CDH cluster), Hiver Service was set to none

 

Capture.PNG

I set it to Hive and CDSW is now working as expected.

Thanks!

View solution in original post

6 REPLIES 6

avatar
Super Collaborator

Hello Chris,

 

What are you trying to achieve? Why are you trying to run in local mode? Can you use the expected yarn client mode instead?

 

From the documentation:

"Unlike PySpark or Sparklyr, you can access a SparkContext assigned to the spark (SparkSession) and sc(SparkContext) objects on console startup, just as when using the Spark shell."

https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_spark_and_scala.htm...

 

Can you access to your hive tables using the spark SparkSession object?

 

Regards,

Peter

avatar
Expert Contributor

Hi Peter,

 

The goal is to run Data Analysis using Spark where part of the data is stored in hive, The idea is to use the whole cluster to distribute the workload.

 

I missed the fact that I was running in local mode (As said it's a pilot, i'm totally new to Cloudera stack). I assumed that the workbench was by default in yarn mode. I'll dig in the doc again.

 

Trying to access hive using Sparksession object in my test was returning nothing. I'll try using yarn & report back

 

thanks!

Regards.

Chris

 

 

avatar
Expert Contributor

Hi Peter,

 

I updated my test code as follow

import org.apache.spark.sql.hive.HiveContext  

  
spark.catalog.listTables.show()

val sqlContext = new HiveContext(sc)

sqlContext.sql("describe database default").show
sqlContext.sql("describe formatted default.mytable").show
sc.version

the results  for SparkSession:

spark.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

(There should be one table listed)

 

 

The result for HiveContext

 

sqlContext.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      /user/hive/warehouse|
+-------------------------+--------------------------+

sqlContext.sql("describe formatted default.mytable").show
Name: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Message: Table or view 'mytable' not found in database 'default';
StackTrace:   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)

while I do get an result for Describe Table, the description returned does not match the setting in Hive.

 

avatar
Super Collaborator

Hi Chris,

 

You can use the SparkSession object to query your Hive tables. You don't need to create a HiveContext.

 

spark.catalog.listTables.show()

spark.sql("describe database default").show
spark.sql("describe formatted default.mytable").show

 

I don't think that this change would fix your issue.

 

This doesn't seem like a problem with your CDSW installation. Could you execute the same commands from a Session terminal or directly on the CDSW host machine using the spark2-shell cli tool to verify this?

 

Could you double check that a Hive gateway is deployed to your CDSW host?

Can you show that the table exists from eg. beeline?

 

Regards,

Peter

 

avatar
Expert Contributor

Hi Peter,

 

See the roles installed. (In case of : Hive GW is the second listed role)

Roles.PNG

 

From Spark2-shell (Workbench host)

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

From HUE running in the CDH cluster (table datascience): the table exist, and can be queried from HUE, the returned data is correct 

metastore.PNG

Spark2-shell from a CDH node

 

scala> spark.catalog.listTables.show()

// Detected repl transcript. Paste more, or ctrl-D to finish.

+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

scala> spark.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      file:/log/clouder...|
+-------------------------+--------------------------+

scala> spark.sql("describe formatted default.datascience").show
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'datascience' not found in database 'default';
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  ... 48 elided


Thanks!

avatar
Expert Contributor

Hello,

 

I fixed it.

 

In the Spark2 Configuration Screen (In Cloudera Manager for the CDH cluster), Hiver Service was set to none

 

Capture.PNG

I set it to Hive and CDSW is now working as expected.

Thanks!