Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Integration Spark with Tableau

avatar
Rising Star

I have been successful in integrating Tableau with Spark Thrift server using Samba ODBC. I have tried using the cache table during the initial SQL and the performance has been great till now. I am now looking for a way to cache and un cache few of the frequently used tables when they are updated using through our data pipelines.

The challlenge that I am facing is that the cache table done via Tableau will remain in cache through the lifetime of the thrift server but when I write my data pipleline process and submit spark jobs it will use a different spark context.

Can anyone please suggest how can I connect to the thrift server context through the backend process.

1. Is there a way to re-use the thrift services from spark-submit or spark shell?

2. At the end of my data pipeline will it be a good idea to invoke a simple shell script that will connect to the thrift service and refresh the cache?

Note both my backend and the BI tool are using the same cluster as I have used the same yarn cluster while starting the thrift service as well as submitting the backend jobs.

Thanks,

Jayadeep

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Spark Thrift Server (STS) runs its SparkContext within the STS JVM daemon. That Spark Context is not available to external clients (with or without spark submit). The only way to access that Spark context is via the JDBC connection to STS. After your external processing has completed, you could submit a cache refresh operation to your STS.

View solution in original post

8 REPLIES 8

avatar
New Contributor

How did you get the ODBC to connect? I keep trying to connect to port 10015 or 10016 which is spark and spark2 respectively and keep getting errors. Trying to connect from windows 64bit spark driver

Driver Version: V1.0.9.1009 Running connectivity tests... Attempting connection Failed to establish connection SQLSTATE: HY000[Simba][ThriftExtension] (5) Error occurred while contacting server: No more data to read.. This could be because you are trying to establish a non-SSL connection to a SSL-enabled server TESTS COMPLETED WITH ERROR.

avatar
Rising Star

In Tableau there is a Spark connector which I used to connect to the Spark Thrift Server. Looking at your error message it seems to me that your application is not able to reach the spark thrift server instance. Can you do a telnet ?

avatar
Contributor

What about registering a temptable and then creating a static table to hold onto the results? Drop/recreate as needed..

avatar
Super Collaborator

Spark Thrift Server (STS) runs its SparkContext within the STS JVM daemon. That Spark Context is not available to external clients (with or without spark submit). The only way to access that Spark context is via the JDBC connection to STS. After your external processing has completed, you could submit a cache refresh operation to your STS.

avatar
Rising Star

Hi All - We have implemented the solution as explained by bikas. At the end of our batch processing we are invoking a process which refreshes the cache inside Spark through beeline server. The tableau dashboard also connects to the same instance of beeline server and therefore is able to get the updated data in-memory. Currently, we are looking at high availability of the cache in case one of the STS goes down.

avatar
Super Collaborator

Dont forget to uncache the old data 🙂 Also, each STS has its own SparkContext which will be lost if that STS is lost. So there is no way currently to have availability of the cache inside an STS if that STS goes down. Having 2 identical STS instances with identical caches is possibly the only solution. Assuming your cache creation code is consistent.

avatar
Rising Star

Thanks bikas....:) I am doing what you mentioned..I have two STS instances on two separate servers and then the caching is being done on both the instances. Quick question when I run CACHE TABLE TABLE_NAME will it first un-cache and then cache the data?

avatar
Super Collaborator

No I dont think Spark will uncache a different data set when a new one is cached. How are you going to load balance or failover from one STS to another?