When it comes to BI integration (eg. consuming from Cognos/Tableau/Pentaho/SpagoBI), it is quite straightforward to see the similarity between Hive and a RDMBS. As in the old SQL-over-relational-DB times, the reporting engine just issues a query through JDBC/ODBC, and voilá. No question here.
But... which would be an equivalent flow using Spark / SparkSQL? How does it match to BI engine?
For example, suppose you have a data store (any Hadoop flavour like HDFS flat file or Hive or HBase) and a Spark process that grabs the data, creates RDDs from it, creates a dataframe, and then you query the latter using SparkSQL, and producing analytics results. This is not just a single query to a datastore. How do you execute this from the BI engine?
If you are not talking only about something like SparkSQL over Hive context (data stored in hive), your BI needs to be capable to create RDDs, datasets or Dataframes before using SparkSQL. This is how notebooks like Zeppelin, iPhyton or Jupyter work.
SparkSQL does provide a JDBC/ODBC interface via Spark Thrift Server. It is part of HDP.
You can connect to STS from any BI client and issue SQL queries.