Created on 10-12-2017 05:42 AM - edited 09-16-2022 05:23 AM
Referring to this article https://hortonworks.com/blog/apache-hive-druid-part-1-3/ by @Carter Shanklin
We have a 12TB+ (growing ~8GB per day) data set of click stream data(user events) in Hive.
The usecase is to run OLAP queries across the data set, for now mostly groupby.
How will this combination perform in context to the data set ?
Also how production ready is the combination.
Created 10-12-2017 05:47 PM
The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.
To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.
I will be happy to help you with that if you can share the queries and schema.
Created 10-12-2017 05:47 PM
The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.
To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.
I will be happy to help you with that if you can share the queries and schema.
Created 10-13-2017 06:42 AM
Thanks a lot for the quick reply.
Let me do a setup with a smaller data set and get back to you with some questions. 😄
Created 10-23-2017 05:10 AM
Hi @Slim
Given that this dataset already loaded into HIVE, and the Hive table will be updated occasionally*. What are my chances in using druid to index this data and use superset to visualise the data(Without replicating in druid) ? And how would you recommend this approach ?
*Will druid automatically update indexes when data is added to HIVE?
Created 10-23-2017 01:23 PM
1- keep in mind that the indexes have some extra overhead thus technically speaking some part of the data will be replicated but in a form of index thus compressed and more concise.
2- Hive will not manage the lifecycle of druid indexes you need to setup some Ozie (or any another workflow manager) to do the create table / insert into statements or drop table to keep the indexes up to date.
3- on a side not sure how the updates lands in your hive system, but if your pattern is mostly append/insert over a period of time then druid is designed for that usecase since data will be partition using time column.