Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

avatar
New Member

Referring to this article https://hortonworks.com/blog/apache-hive-druid-part-1-3/ by @Carter Shanklin

We have a 12TB+ (growing ~8GB per day) data set of click stream data(user events) in Hive.

The usecase is to run OLAP queries across the data set, for now mostly groupby.

How will this combination perform in context to the data set ?

Also how production ready is the combination.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi @Roshan Dissanayake

The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.

To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.

I will be happy to help you with that if you can share the queries and schema.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Hi @Roshan Dissanayake

The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.

To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.

I will be happy to help you with that if you can share the queries and schema.

avatar
New Member

Thanks a lot for the quick reply.

Let me do a setup with a smaller data set and get back to you with some questions. 😄

avatar
New Member

Hi @Slim

Given that this dataset already loaded into HIVE, and the Hive table will be updated occasionally*. What are my chances in using druid to index this data and use superset to visualise the data(Without replicating in druid) ? And how would you recommend this approach ?

*Will druid automatically update indexes when data is added to HIVE?

avatar
Expert Contributor

@Roshan Dissanayake

1- keep in mind that the indexes have some extra overhead thus technically speaking some part of the data will be replicated but in a form of index thus compressed and more concise.

2- Hive will not manage the lifecycle of druid indexes you need to setup some Ozie (or any another workflow manager) to do the create table / insert into statements or drop table to keep the indexes up to date.

3- on a side not sure how the updates lands in your hive system, but if your pattern is mostly append/insert over a period of time then druid is designed for that usecase since data will be partition using time column.