we have the following usecase :
ingestion of 3.5 billion log transactions a day , that we need to process , and expose to our front-end reports on top of it .
the reports can be dynamic , and on any of the dimensions of the data .
the response should be in reasonable response time (2-3 seconds max).
the user can query the data (aggregated, top reports) up to 1 year .
the data is persisted to HDFS .
we thought on doing in with spark structured streaming , but the spark sql gives poor performance for this scale without pre-aggregation (that is not dynamic) .
the obvious choice is Vertica or ms-sql columnar DB, or other similar solutions , but they are all expensive .
i thought of ingest the data with spark , and index it in another layer so it give us fast response time .
is there any open source solution for that ? i looked at snappydata example , but they don't seem to shorten the response time in that magnitude according to the benchmark they present vs spark .
please help people ....
Did you check components Apache Kafka , Apache storm project and Kudu project ?
It may help you to handle faster streaming.
are there any available benchmarks of those frameworks in terms of size of data and query response time ?
Hi, You can check example cases published by each of this project on their website. That may give better idea. Regards, Fahim