Created on 07-12-2017 11:56 AM - edited 09-16-2022 04:55 AM
we have the following usecase :
ingestion of 3.5 billion log transactions a day , that we need to process , and expose to our front-end reports on top of it .
the reports can be dynamic , and on any of the dimensions of the data .
the response should be in reasonable response time (2-3 seconds max).
the user can query the data (aggregated, top reports) up to 1 year .
the data is persisted to HDFS .
we thought on doing in with spark structured streaming , but the spark sql gives poor performance for this scale without pre-aggregation (that is not dynamic) .
the obvious choice is Vertica or ms-sql columnar DB, or other similar solutions , but they are all expensive .
i thought of ingest the data with spark , and index it in another layer so it give us fast response time .
is there any open source solution for that ? i looked at snappydata example , but they don't seem to shorten the response time in that magnitude according to the benchmark they present vs spark .
please help people ....
Created 08-01-2017 04:23 PM
Did you try to ingest directly in Hive LLAP?
Created 08-04-2017 01:15 PM
Did you check components Apache Kafka , Apache storm project and Kudu project ?
It may help you to handle faster streaming.
Created 08-06-2017 10:15 AM
are there any available benchmarks of those frameworks in terms of size of data and query response time ?
Created 08-06-2017 01:18 PM
Hi, You can check example cases published by each of this project on their website. That may give better idea. Regards, Fahim