Support Questions

canadalove150 · ‎07-24-2017

Hi,

I am new to Spark and I would like to ask a question on the use case I am trying to work upon.

Plan is to use Hadoop/Spark as a reporting solution, fetching data from an RDBMS(Oracle) source system, perform ETL and execute report jobs using Spark SQL.

The question is, can Spark be used for interactive report requests as well? For example a user requesting a report from a web application.

Will the new Spark Structured Streaming be helpful to my case?

Or should i get the ETL Output into a structured DB for interactive reports?

Please suggest. Thanks in Advance.

mgaido1 · ‎07-25-2017

Hi,

the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.

Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.

Thanks,

Marco

View solution in original post

mgaido1 · ‎07-25-2017

Hi,

the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.

Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.

Thanks,

Marco

canadalove150 · ‎07-25-2017

Thanks Marco.

As we do not want to go back to RDBMS again, can I use Vertica or HBase or Presto as a structured/columnar database equivalent which saves the data post ETL processing from Spark. Would there be any performance benefits between these? Any suggestions?

mgaido1 · ‎07-27-2017

It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.

Cloudera Community

Support Questions

Can Apache Spark be used for interactive reporting ?