Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

looking for the low latency query framework to expose the streamed events - what's the best choice ?

Hi ,

we have the following usecase :

ingestion of 3.5 billion log transactions a day , that we need to process , and expose to our front-end reports on top of it .

the reports can be dynamic , and on any of the dimensions of the data .

the response should be in reasonable response time (2-3 seconds max).

the user can query the data (aggregated, top reports) up to 1 year .

the data is persisted to HDFS .

we thought on doing in with spark structured streaming , but the spark sql gives poor performance for this scale without pre-aggregation (that is not dynamic) .

the obvious choice is Vertica or ms-sql columnar DB, or other similar solutions , but they are all expensive .

i thought of ingest the data with spark , and index it in another layer so it give us fast response time .

is there any open source solution for that ? i looked at snappydata example , but they don't seem to shorten the response time in that magnitude according to the benchmark they present vs spark .

please help people ....


Expert Contributor

Did you try to ingest directly in Hive LLAP?


Hi ,

Did you check components Apache Kafka , Apache storm project and Kudu project ?

It may help you to handle faster streaming.



are there any available benchmarks of those frameworks in terms of size of data and query response time ?


Hi, You can check example cases published by each of this project on their website. That may give better idea. Regards, Fahim

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.