Support Questions

ChrisV · ‎07-13-2017

Hello,

While running a Pilot for CDSW & CDH, i'm struggling running queries on hive table.

Scala Workbench

Test code:

import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext  
import org.apache.spark.sql.Column

val sparkSession = SparkSession.builder.master("local").appName("mapexample").
        enableHiveSupport().getOrCreate()
  
sparkSession.catalog.listTables.show()

val sqlContext = new HiveContext(sc)

sqlContext.sql("describe database default").show
sqlContext.sql("describe formatted default.mytable").show
sc.version

Test return

import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.Column
val sparkSession = SparkSession.builder.master("local").appName("mapexample").
        enableHiveSupport().getOrCreate()
sparkSession.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

val sqlContext = new HiveContext(sc)
sqlContext.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      /user/hive/warehouse|
+-------------------------+--------------------------+

sqlContext.sql("describe formatted default.mytable").show
Name: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Message: Table or view 'mytable' not found in database 'default';
StackTrace:   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)

Interesing to note: Access hive though SparkSession returns nothing, while HiveContext does return a description of the default DB that does not match the actual default DB

From Hue

default
Default Hive database
hdfs://<masked>/user/hive/warehouse
public
ROLE

The CDSW host has the right gateway roles installed, hive-site.xml is present with the spark config files.

Hive Metastore log does not register any access when the workbench is trying to access Hive

No Kerberos involved

I ran out of option to check, hence this post

Thanks

Chris

ChrisV · ‎07-20-2017

Hello,

I fixed it.

In the Spark2 Configuration Screen (In Cloudera Manager for the CDH cluster), Hiver Service was set to none

I set it to Hive and CDSW is now working as expected.

Thanks!

View solution in original post

peter_ableda · ‎07-13-2017

Hello Chris,

What are you trying to achieve? Why are you trying to run in local mode? Can you use the expected yarn client mode instead?

From the documentation:

"Unlike PySpark or Sparklyr, you can access a SparkContext assigned to the spark (SparkSession) and sc(SparkContext) objects on console startup, just as when using the Spark shell."

https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_spark_and_scala.htm...

Can you access to your hive tables using the spark SparkSession object?

Regards,

Peter

ChrisV · ‎07-13-2017

Hi Peter,

The goal is to run Data Analysis using Spark where part of the data is stored in hive, The idea is to use the whole cluster to distribute the workload.

I missed the fact that I was running in local mode (As said it's a pilot, i'm totally new to Cloudera stack). I assumed that the workbench was by default in yarn mode. I'll dig in the doc again.

Trying to access hive using Sparksession object in my test was returning nothing. I'll try using yarn & report back

thanks!

Regards.

Chris

ChrisV · ‎07-13-2017

Hi Peter,

I updated my test code as follow

import org.apache.spark.sql.hive.HiveContext  

  
spark.catalog.listTables.show()

val sqlContext = new HiveContext(sc)

sqlContext.sql("describe database default").show
sqlContext.sql("describe formatted default.mytable").show
sc.version

the results for SparkSession:

spark.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

(There should be one table listed)

The result for HiveContext

sqlContext.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      /user/hive/warehouse|
+-------------------------+--------------------------+

sqlContext.sql("describe formatted default.mytable").show
Name: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Message: Table or view 'mytable' not found in database 'default';
StackTrace:   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)

while I do get an result for Describe Table, the description returned does not match the setting in Hive.

peter_ableda · ‎07-13-2017

Hi Chris,

You can use the SparkSession object to query your Hive tables. You don't need to create a HiveContext.

spark.catalog.listTables.show()

spark.sql("describe database default").show
spark.sql("describe formatted default.mytable").show

I don't think that this change would fix your issue.

This doesn't seem like a problem with your CDSW installation. Could you execute the same commands from a Session terminal or directly on the CDSW host machine using the spark2-shell cli tool to verify this?

Could you double check that a Hive gateway is deployed to your CDSW host?

Can you show that the table exists from eg. beeline?

Regards,

Peter

ChrisV · ‎07-13-2017

Hi Peter,

See the roles installed. (In case of : Hive GW is the second listed role)

From Spark2-shell (Workbench host)

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

From HUE running in the CDH cluster (table datascience): the table exist, and can be queried from HUE, the returned data is correct

Spark2-shell from a CDH node

scala> spark.catalog.listTables.show()

// Detected repl transcript. Paste more, or ctrl-D to finish.

+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

scala> spark.sql("describe database default").show
+-------------------------+--------------------------+
|database_description_item|database_description_value|
+-------------------------+--------------------------+
|            Database Name|                   default|
|              Description|          default database|
|                 Location|      file:/log/clouder...|
+-------------------------+--------------------------+

scala> spark.sql("describe formatted default.datascience").show
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'datascience' not found in database 'default';
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:138)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:289)
  at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:437)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  ... 48 elided

Thanks!

ChrisV · ‎07-20-2017

Hello,

I fixed it.

In the Spark2 Configuration Screen (In Cloudera Manager for the CDH cluster), Hiver Service was set to none

I set it to Hive and CDSW is now working as expected.

Thanks!

Cloudera Community

Support Questions

CDSW - Can't access hive Tables from Scala Workbench