question Re: Druid Hive combination for 12TB+ Dataset (OLAP Usecase) in Archives of Support Questions (Read Only)

Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

roshand — Fri, 16 Sep 2022 12:23:29 GMT

Referring to this article https://hortonworks.com/blog/apache-hive-druid-part-1-3/ by @Carter Shanklin

We have a 12TB+ (growing ~8GB per day) data set of click stream data(user events) in Hive.

The usecase is to run OLAP queries across the data set, for now mostly groupby.

How will this combination perform in context to the data set ?

Also how production ready is the combination.

Re: Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

sbouguerra — Fri, 13 Oct 2017 00:47:22 GMT

Hi @Roshan Dissanayake

The integration is production ready, we are planning on GA version HDP 2.6.3 which going to be released soon.

To answer your question about performance, i don't think the data size is an issue since Druid/LLAP can scale horizontally. The real question is how much of your query can be pushed to the druid cluster. This might require rethinking the schema of the OLAP Cubes and maybe rewrite some of the queries.

I will be happy to help you with that if you can share the queries and schema.

Re: Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

roshand — Fri, 13 Oct 2017 13:42:54 GMT

Thanks a lot for the quick reply.

Let me do a setup with a smaller data set and get back to you with some questions. 😄

Re: Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

roshand — Mon, 23 Oct 2017 12:10:06 GMT

Hi @Slim

Given that this dataset already loaded into HIVE, and the Hive table will be updated occasionally*. What are my chances in using druid to index this data and use superset to visualise the data(Without replicating in druid) ? And how would you recommend this approach ?

*Will druid automatically update indexes when data is added to HIVE?

Re: Druid Hive combination for 12TB+ Dataset (OLAP Usecase)

sbouguerra — Mon, 23 Oct 2017 20:23:01 GMT

@Roshan Dissanayake

1- keep in mind that the indexes have some extra overhead thus technically speaking some part of the data will be replicated but in a form of index thus compressed and more concise.

2- Hive will not manage the lifecycle of druid indexes you need to setup some Ozie (or any another workflow manager) to do the create table / insert into statements or drop table to keep the indexes up to date.

3- on a side not sure how the updates lands in your hive system, but if your pattern is mostly append/insert over a period of time then druid is designed for that usecase since data will be partition using time column.