Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark temporary table is not shown in beeline

Highlighted

Spark temporary table is not shown in beeline

New Contributor

sparkbeeline.pngHi

I'm using the latest HDP Sandbox (2.4.0.0-169). I have written below code in Spark-shell (Spark Version 1.6.0) -----

var orcData = sqlContext.sql("select code from sample_07"); var paymentDataCache = orcData.cache; paymentDataCache.registerTempTable("paymentDataCache");

----

Followed below commands to start thrift server and beeline

1) export SPARK_HOME=/usr/hdp/2.4.0.0-169/spark/

2) sudo ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10015 3)./bin/beeline

4) !connect jdbc:hive2://localhost:10015

Now If I execute show tables, I'm expecting to see paymentDataCache temporary table. Please find attached screen shot.

I also tried to start the thrift server using

sudo ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10015 --conf spark.sql.hive.thriftServer.singleSession=true but no luck.

We tried the same process in HDP (2.3.2.0-2950 with Spark 1.4.1) 9 node cluster but we do not see temporary tables in Spark beeline.

8 REPLIES 8

Re: Spark temporary table is not shown in beeline

Hello Amit:

Spark contains both Transformations and Actions. Transformations use "lazy execution", which means they are not executed until an action is performed. From what I see, you have not called any actions in your code. Try adding this statement after the .cache statement and BEFORE the .registerTempTable statement:

var orcData = sqlContext.sql("select code from sample_07"); 
var paymentDataCache = orcData.cache; 
paymentDataCache.count;  // add this action
paymentDataCache.registerTempTable("paymentDataCache");

This action should cause the whole pipeline to execute, and allow temp table to be created.

Re: Spark temporary table is not shown in beeline

New Contributor

Hi Paul,

Thanks for the response. I tried your suggestion but no luck. I still do not see temporary table (paymentDataCache).

Re: Spark temporary table is not shown in beeline

Okay, if you want to see tables from hive thrift server, I believe you should use HiveContext, not sqlContext. HiveContext can be created from sc, the SparkContext, like this:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

HiveContext will interoperate with the Hive metastore, which should allow you to see the temp table. In addition, please test your temp table with a statement like this:

head(sql(hiveContext, "SELECT code FROM paymentDataCache"))

Re: Spark temporary table is not shown in beeline

Super Guru

It's got to run a select and check the data and have Spark exit. This is good to do within Zeppelin so you can step throug hit

Re: Spark temporary table is not shown in beeline

The main reason tables that exist in Hive don't show up in SparkSQL if the SparkSQL isn't using Hive Metastore. SparkSQL uses DerrbyDB as Metastore by default it there isn't a valid hive-site.xml in /etc/spark/conf

Re: Spark temporary table is not shown in beeline

Guru

@Amit Kumar Agarwal

You need to star the Thrift server from the Spark driver the holds the HiveContext you are using to create the temp tables. Try this:

Start a spark-shell like this:

spark-shell --conf spark.sql.hive.thriftServer.singleSession=true

Inside the spark-shell: (Make sure nothing is running on port 10002 [netstat -nlp|grep 10002])

import org.apache.spark.sql.hive.thriftserver._
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
var orcData = hiveContext.sql("select code from sample_07")
var paymentDataCache = orcData.cache
paymentDataCache.registerTempTable("paymentDataCache")
hiveContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(hiveContext)

Then from a beeline shell:

!connect jdbc:hive2://localhost:10002
show tables;

You should be able to see the temp table for the exposed HiveContext.

+----------------------+--------------+--+
|      tableName       | isTemporary  |
+----------------------+--------------+--+
| paymentdatacache     | true         |
| dimcustomer          | false        |
| dimcustomertemp      | false        |
| dimproduct           | false        |
| dimproducttemp       | false        |
| factonlinesales      | false        |
| factonlinesalestemp  | false        |
| factsales            | false        |
| factsalestemp        | false        |
| sample_07            | false        |
| sample_08            | false        |
+----------------------+--------------+--+
11 rows selected (0.118 seconds)

Re: Spark temporary table is not shown in beeline

Expert Contributor
@Vadim

Thank you this helps. But I would say its a work around, if you would agree. As I cannot start the STS from Ambari, so the developer, who wants to share the tables with others its not easy. If I start the STS from Ambari, it will not work.

Is there any proper solution, where I can start STS from Ambari and end users can start the shell and register temp tables and other users can share the same temp tables ? Or the same user can use the temp tables from Beeline?

However, I note this work around for our users till you suggest proper solution :-).

Many thanks.

Re: Spark temporary table is not shown in beeline

Guru

@Smart Solutions

You would need to register a Spark context with a ThriftServer and I don't believe that there is currently any way to do that. Because a Temp Table is actually and RDD local to the context to which it was created, it will cannot be accessed from a ThriftServer that does not have a handle on the context. The generic ThriftServer is able to see all of the Hive tables because it has a connection to the Hive meta store. The only other way to share RDDs across context's that I am aware of is to use an in-memory files system. This is not an Ambari issue, it's a matter of RDDs being a context local data structure. I don't believe the integration you are looking for currently exists.