Created 07-24-2017 05:35 PM
Hi,
I am new to Spark and I would like to ask a question on the use case I am trying to work upon.
Plan is to use Hadoop/Spark as a reporting solution, fetching data from an RDBMS(Oracle) source system, perform ETL and execute report jobs using Spark SQL.
The question is, can Spark be used for interactive report requests as well? For example a user requesting a report from a web application.
Will the new Spark Structured Streaming be helpful to my case?
Or should i get the ETL Output into a structured DB for interactive reports?
Please suggest. Thanks in Advance.
Created 07-25-2017 02:10 PM
Hi,
the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.
Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.
Thanks,
Marco
Created 07-25-2017 02:10 PM
Hi,
the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option.
Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes.
Thanks,
Marco
Created 07-25-2017 09:16 PM
Thanks Marco.
As we do not want to go back to RDBMS again, can I use Vertica or HBase or Presto as a structured/columnar database equivalent which saves the data post ETL processing from Spark. Would there be any performance benefits between these? Any suggestions?
Created 07-27-2017 03:35 PM
It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.