Created on 12-14-2015 03:25 PM - edited 09-16-2022 02:53 AM
Hi,
Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau?
We are building a data lake with all our organization’s data in the data lake (as AVRO formatted files). We need to create dashboards & reports using Tableau with the data available in the data lake. The challenge is that some of these reports have to process millions of records and have strict timelines for loading (sometimes as strict as <10 seconds load time for reports). We are right now forced to create an Oracle datamart (populated with the data from the data lake) – from where Tableau pull data to generate reports.
We want to avoid creating a separate data mart and hence are looking at connecting Tableau directly to the Hadoop Datalake. While pure SPARK is ruled out as the Reports need to be interactive, go to know (from yesterday’s Hortonworks webinar on SPARK) that SPARK Streaming can be used here. Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau?
Are there any similar example use cases that you can point me to please?
thanks, Raga
Created 01-07-2016 04:10 PM
Actually, many BI vendors including Tableau have announced a Spark Connector over JDBC, which should presumably be able to leverage data loaded into RDD's in memory. If you load data via Spark Streaming into RDD, then either schematize it (rdd.registerTempTable) or convert to DataFrame (rdd.toDF), you should be able to query that data from a JDBC connection and display in dashboard.
Here is info on Tableau connector, including a video at bottom of page:
Created 12-14-2015 05:31 PM
Tableau or any other BI tool for that matter can't connect directly to Spark Streaming. Spark Streaming only processes the data -- you still need to persist it in HDFS or somewhere else before Tableau or anything else can connect to it.
In case you need to do interactive analysis with a very short SLA, you need a system which can index the data. Pure row scans won't cut it. One example would be to connect Spark Streaming to Solr. Solr will index the data as it is inserted. You can then build a read-only dashboard using Banana, or build a custom app which queries Solr for user-defined queries.
So the flow is:
Streaming Data -> Spark Streaming -> Solr -> Banana Dashboard (or a custom app if interactivity is desired)
Look here for an example of streaming Tweets from Spark into Solr:
Created 01-07-2016 04:10 PM
Actually, many BI vendors including Tableau have announced a Spark Connector over JDBC, which should presumably be able to leverage data loaded into RDD's in memory. If you load data via Spark Streaming into RDD, then either schematize it (rdd.registerTempTable) or convert to DataFrame (rdd.toDF), you should be able to query that data from a JDBC connection and display in dashboard.
Here is info on Tableau connector, including a video at bottom of page: