Created 06-08-2016 08:42 AM
Hello together, I'm Dave.
I want to query some data from HBase (or HDFS or a proper database) based on a webservice request and for this reason realtime processing is important for me. I read a little bit but I can't figure out which tool is the best/fastest one for me. I do not have huge data but it is important that I can read them very fast in parallel due to the webservice.
Is there any overview, where the different tools a compared to each other, or is there any clear preference? It will be helpful to know which tool will be the best one. But it would be much better to have a current benchmark or something else to have a solid basis of decisionmaking.
I hope someone can help me to get a better overview in this topic and to get the proper tool.
thank you and best regards
Dave
Created 06-08-2016 09:42 AM
I think you are putting it up wrongly, you should give us your requirements ( sub 1s, sub 5 second queries, data volumes queried, data volumes total , number requests per sec/min , reading only?, daily updates or lots of writers? ) and we can give an answer based on that. Benchmarks always change very much depending on requirements. As far as I am concerned they are pretty much meaningless for a specific usecase.
Hive
Hive is a analytical aggregation tool and it is hard to get queries below 5s, you can tune it for up to hundred parallel queries but it will not work for sub second queries. However we will soon have LLAP which is a persistent process that should bring down the minimum time significantly ( to perhaps 1-2s). The good thing is after that it scales to huge data volumes with very predictable behavior. So if you need to potentially aggregate billions of rows this would be the way to go.
Phoenix/HBase
Phoenix/HBase is much better for sub second queries that scan a small amount of data ( up to millions of rows ). In these cases you will also get a higher concurrency so it would be a good fit for a webservice, however you need to be careful in the data modeling. And it will not scale as predictably as Hive. Also the only tool that I am aware of where you could have small updates as well.
Presto
Don't know it too much but as I understood it its main advantage is that you can run probabilistic queries. That provide approximate results very very quickly. Someone else might chime in.
Drill
Similar to Presto I don't know too much about it, seems to be quite impressive for a variety of usecases but you don't have the wide support of Hive and Phoenix
Kylin
OLAP engine for VERY fast queries of aggregated data that can be precomputed into a cube.
Druid
Forgot one, Druid sounds really interesting, they have an OLAP engine based on inverted indexes which might work great to aggregate large but not huge subsets of data.
Created 06-08-2016 09:42 AM
I think you are putting it up wrongly, you should give us your requirements ( sub 1s, sub 5 second queries, data volumes queried, data volumes total , number requests per sec/min , reading only?, daily updates or lots of writers? ) and we can give an answer based on that. Benchmarks always change very much depending on requirements. As far as I am concerned they are pretty much meaningless for a specific usecase.
Hive
Hive is a analytical aggregation tool and it is hard to get queries below 5s, you can tune it for up to hundred parallel queries but it will not work for sub second queries. However we will soon have LLAP which is a persistent process that should bring down the minimum time significantly ( to perhaps 1-2s). The good thing is after that it scales to huge data volumes with very predictable behavior. So if you need to potentially aggregate billions of rows this would be the way to go.
Phoenix/HBase
Phoenix/HBase is much better for sub second queries that scan a small amount of data ( up to millions of rows ). In these cases you will also get a higher concurrency so it would be a good fit for a webservice, however you need to be careful in the data modeling. And it will not scale as predictably as Hive. Also the only tool that I am aware of where you could have small updates as well.
Presto
Don't know it too much but as I understood it its main advantage is that you can run probabilistic queries. That provide approximate results very very quickly. Someone else might chime in.
Drill
Similar to Presto I don't know too much about it, seems to be quite impressive for a variety of usecases but you don't have the wide support of Hive and Phoenix
Kylin
OLAP engine for VERY fast queries of aggregated data that can be precomputed into a cube.
Druid
Forgot one, Druid sounds really interesting, they have an OLAP engine based on inverted indexes which might work great to aggregate large but not huge subsets of data.