Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can Apache Spark be used for interactive reporting ?

avatar

Hi,

I am new to Spark and I would like to ask a question on the use case I am trying to work upon.

Plan is to use Hadoop/Spark as a reporting solution, fetching data from an RDBMS(Oracle) source system, perform ETL and execute report jobs using Spark SQL.

The question is, can Spark be used for interactive report requests as well? For example a user requesting a report from a web application.

Will the new Spark Structured Streaming be helpful to my case?

Or should i get the ETL Output into a structured DB for interactive reports?

Please suggest. Thanks in Advance.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi,

the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.

Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.

Thanks,

Marco

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Hi,

the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.

Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.

Thanks,

Marco

avatar

Thanks Marco.

As we do not want to go back to RDBMS again, can I use Vertica or HBase or Presto as a structured/columnar database equivalent which saves the data post ETL processing from Spark. Would there be any performance benefits between these? Any suggestions?

avatar
Expert Contributor

It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.