I'm using the latest HDP Sandbox (220.127.116.11-169). I have written below code in Spark-shell (Spark Version 1.6.0) -----
var orcData = sqlContext.sql("select code from sample_07"); var paymentDataCache = orcData.cache; paymentDataCache.registerTempTable("paymentDataCache");
Followed below commands to start thrift server and beeline
1) export SPARK_HOME=/usr/hdp/18.104.22.168-169/spark/
2) sudo ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10015 3)./bin/beeline
4) !connect jdbc:hive2://localhost:10015
Now If I execute show tables, I'm expecting to see paymentDataCache temporary table. Please find attached screen shot.
I also tried to start the thrift server using
sudo ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10015 --conf spark.sql.hive.thriftServer.singleSession=true but no luck.
We tried the same process in HDP (22.214.171.124-2950 with Spark 1.4.1) 9 node cluster but we do not see temporary tables in Spark beeline.
Spark contains both Transformations and Actions. Transformations use "lazy execution", which means they are not executed until an action is performed. From what I see, you have not called any actions in your code. Try adding this statement after the .cache statement and BEFORE the .registerTempTable statement:
var orcData = sqlContext.sql("select code from sample_07"); var paymentDataCache = orcData.cache; paymentDataCache.count; // add this action paymentDataCache.registerTempTable("paymentDataCache");
This action should cause the whole pipeline to execute, and allow temp table to be created.
Thanks for the response. I tried your suggestion but no luck. I still do not see temporary table (paymentDataCache).
Okay, if you want to see tables from hive thrift server, I believe you should use HiveContext, not sqlContext. HiveContext can be created from sc, the SparkContext, like this:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext will interoperate with the Hive metastore, which should allow you to see the temp table. In addition, please test your temp table with a statement like this:
head(sql(hiveContext, "SELECT code FROM paymentDataCache"))
It's got to run a select and check the data and have Spark exit. This is good to do within Zeppelin so you can step throug hit
The main reason tables that exist in Hive don't show up in SparkSQL if the SparkSQL isn't using Hive Metastore. SparkSQL uses DerrbyDB as Metastore by default it there isn't a valid hive-site.xml in /etc/spark/conf
Start a spark-shell like this:
spark-shell --conf spark.sql.hive.thriftServer.singleSession=true
Inside the spark-shell: (Make sure nothing is running on port 10002 [netstat -nlp|grep 10002])
import org.apache.spark.sql.hive.thriftserver._ import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) var orcData = hiveContext.sql("select code from sample_07") var paymentDataCache = orcData.cache paymentDataCache.registerTempTable("paymentDataCache") hiveContext.setConf("hive.server2.thrift.port","10002") HiveThriftServer2.startWithContext(hiveContext)
Then from a beeline shell:
!connect jdbc:hive2://localhost:10002 show tables;
You should be able to see the temp table for the exposed HiveContext.
+----------------------+--------------+--+ | tableName | isTemporary | +----------------------+--------------+--+ | paymentdatacache | true | | dimcustomer | false | | dimcustomertemp | false | | dimproduct | false | | dimproducttemp | false | | factonlinesales | false | | factonlinesalestemp | false | | factsales | false | | factsalestemp | false | | sample_07 | false | | sample_08 | false | +----------------------+--------------+--+ 11 rows selected (0.118 seconds)
Thank you this helps. But I would say its a work around, if you would agree. As I cannot start the STS from Ambari, so the developer, who wants to share the tables with others its not easy. If I start the STS from Ambari, it will not work.
Is there any proper solution, where I can start STS from Ambari and end users can start the shell and register temp tables and other users can share the same temp tables ? Or the same user can use the temp tables from Beeline?
However, I note this work around for our users till you suggest proper solution :-).
You would need to register a Spark context with a ThriftServer and I don't believe that there is currently any way to do that. Because a Temp Table is actually and RDD local to the context to which it was created, it will cannot be accessed from a ThriftServer that does not have a handle on the context. The generic ThriftServer is able to see all of the Hive tables because it has a connection to the Hive meta store. The only other way to share RDDs across context's that I am aware of is to use an in-memory files system. This is not an Ambari issue, it's a matter of RDDs being a context local data structure. I don't believe the integration you are looking for currently exists.