Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive to Druid Methodology

avatar
Explorer

I'm looking into loading existing Hive tables stored as parquet to Druid so i wonder about a few things:

  • I thought about doing it without hive directly from Druid but seems like it does not support nested parquet objects. Has anyone had the same issue?
  • How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?
  • Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"
  • One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?
  • I haven't found a documentation for all the different druid configurations from Hive (such as "hive.druid.broker.address.default"). Do you mind point me to it?

Thanks!

Shahar

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Please take a look at this page HDP-2.6-Hive-Druid

  • How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?

Use

"druid.query.granularity" = "HOUR"
  • Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"

sorry NO complex metrics

  • One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?

Use Insert Into for newer data and Alter table to add new columns.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Please take a look at this page HDP-2.6-Hive-Druid

  • How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?

Use

"druid.query.granularity" = "HOUR"
  • Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"

sorry NO complex metrics

  • One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?

Use Insert Into for newer data and Alter table to add new columns.

avatar
Explorer

Thanks @Slim!

I went over the doc, but im still not 100% sure on the relationship between the segment granularity and the output table. again shuld i be aggregating within the hive query at all? or can i output "raw" data and the granularity setting will aggregate for me (and if so, will it support sum/count etc out of the box?)

avatar
Expert Contributor

If you are interested on rolling up SUM and Count then you output raw data as is and "druid.query.granularity"="HOUR" (FYI not segment granularity) will do the rollup for you. If you want to compute other rollups metrics like using MIN/MAX/AVG etc then you need to do the rollup before.

If you share an example of your use cases i can help explaining more.

Thanks.