Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive on Tez, Impala, Presto, Drill or Phoenix?

avatar
New Contributor

Hello together, I'm Dave.

I want to query some data from HBase (or HDFS or a proper database) based on a webservice request and for this reason realtime processing is important for me. I read a little bit but I can't figure out which tool is the best/fastest one for me. I do not have huge data but it is important that I can read them very fast in parallel due to the webservice.

Is there any overview, where the different tools a compared to each other, or is there any clear preference? It will be helpful to know which tool will be the best one. But it would be much better to have a current benchmark or something else to have a solid basis of decisionmaking.

I hope someone can help me to get a better overview in this topic and to get the proper tool.

thank you and best regards

Dave

1 ACCEPTED SOLUTION

avatar
Master Guru

I think you are putting it up wrongly, you should give us your requirements ( sub 1s, sub 5 second queries, data volumes queried, data volumes total , number requests per sec/min , reading only?, daily updates or lots of writers? ) and we can give an answer based on that. Benchmarks always change very much depending on requirements. As far as I am concerned they are pretty much meaningless for a specific usecase.

Hive

Hive is a analytical aggregation tool and it is hard to get queries below 5s, you can tune it for up to hundred parallel queries but it will not work for sub second queries. However we will soon have LLAP which is a persistent process that should bring down the minimum time significantly ( to perhaps 1-2s). The good thing is after that it scales to huge data volumes with very predictable behavior. So if you need to potentially aggregate billions of rows this would be the way to go.

Phoenix/HBase

Phoenix/HBase is much better for sub second queries that scan a small amount of data ( up to millions of rows ). In these cases you will also get a higher concurrency so it would be a good fit for a webservice, however you need to be careful in the data modeling. And it will not scale as predictably as Hive. Also the only tool that I am aware of where you could have small updates as well.

Presto

Don't know it too much but as I understood it its main advantage is that you can run probabilistic queries. That provide approximate results very very quickly. Someone else might chime in.

Drill

Similar to Presto I don't know too much about it, seems to be quite impressive for a variety of usecases but you don't have the wide support of Hive and Phoenix

Kylin

OLAP engine for VERY fast queries of aggregated data that can be precomputed into a cube.

Druid

Forgot one, Druid sounds really interesting, they have an OLAP engine based on inverted indexes which might work great to aggregate large but not huge subsets of data.

View solution in original post

1 REPLY 1

avatar
Master Guru

I think you are putting it up wrongly, you should give us your requirements ( sub 1s, sub 5 second queries, data volumes queried, data volumes total , number requests per sec/min , reading only?, daily updates or lots of writers? ) and we can give an answer based on that. Benchmarks always change very much depending on requirements. As far as I am concerned they are pretty much meaningless for a specific usecase.

Hive

Hive is a analytical aggregation tool and it is hard to get queries below 5s, you can tune it for up to hundred parallel queries but it will not work for sub second queries. However we will soon have LLAP which is a persistent process that should bring down the minimum time significantly ( to perhaps 1-2s). The good thing is after that it scales to huge data volumes with very predictable behavior. So if you need to potentially aggregate billions of rows this would be the way to go.

Phoenix/HBase

Phoenix/HBase is much better for sub second queries that scan a small amount of data ( up to millions of rows ). In these cases you will also get a higher concurrency so it would be a good fit for a webservice, however you need to be careful in the data modeling. And it will not scale as predictably as Hive. Also the only tool that I am aware of where you could have small updates as well.

Presto

Don't know it too much but as I understood it its main advantage is that you can run probabilistic queries. That provide approximate results very very quickly. Someone else might chime in.

Drill

Similar to Presto I don't know too much about it, seems to be quite impressive for a variety of usecases but you don't have the wide support of Hive and Phoenix

Kylin

OLAP engine for VERY fast queries of aggregated data that can be precomputed into a cube.

Druid

Forgot one, Druid sounds really interesting, they have an OLAP engine based on inverted indexes which might work great to aggregate large but not huge subsets of data.