Archives of Support Questions (Read Only)

roshand · ‎10-12-2017

Referring to this article https://hortonworks.com/blog/apache-hive-druid-part-1-3/ by @Carter Shanklin

We have a 12TB+ (growing ~8GB per day) data set of click stream data(user events) in Hive.

The usecase is to run OLAP queries across the data set, for now mostly groupby.

How will this combination perform in context to the data set ?

Also how production ready is the combination.

sbouguerra · ‎10-12-2017

Hi @Roshan Dissanayake

The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.

To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.

I will be happy to help you with that if you can share the queries and schema.

View solution in original post

sbouguerra · ‎10-12-2017

Hi @Roshan Dissanayake

The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.

To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.

I will be happy to help you with that if you can share the queries and schema.

roshand · ‎10-13-2017

Thanks a lot for the quick reply.

Let me do a setup with a smaller data set and get back to you with some questions. 😄

roshand · ‎10-23-2017

Hi @Slim

Given that this dataset already loaded into HIVE, and the Hive table will be updated occasionally*. What are my chances in using druid to index this data and use superset to visualise the data(Without replicating in druid) ? And how would you recommend this approach ?

*Will druid automatically update indexes when data is added to HIVE?

sbouguerra · ‎10-23-2017

@Roshan Dissanayake

1- keep in mind that the indexes have some extra overhead thus technically speaking some part of the data will be replicated but in a form of index thus compressed and more concise.

2- Hive will not manage the lifecycle of druid indexes you need to setup some Ozie (or any another workflow manager) to do the create table / insert into statements or drop table to keep the indexes up to date.

3- on a side not sure how the updates lands in your hive system, but if your pattern is mostly append/insert over a period of time then druid is designed for that usecase since data will be partition using time column.

Cloudera Community

Archives of Support Questions (Read Only)

Druid Hive combination for 12TB+ Dataset (OLAP Usecase)