Support Questions

shaharc · ‎08-31-2018

I'm looking into loading existing Hive tables stored as parquet to Druid so i wonder about a few things:

I thought about doing it without hive directly from Druid but seems like it does not support nested parquet objects. Has anyone had the same issue?
How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?
Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"
One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?
I haven't found a documentation for all the different druid configurations from Hive (such as "hive.druid.broker.address.default"). Do you mind point me to it?

Thanks!

Shahar

sbouguerra · ‎08-31-2018

Please take a look at this page HDP-2.6-Hive-Druid

How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?

Use

"druid.query.granularity" = "HOUR"

Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"

sorry NO complex metrics

One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?

Use Insert Into for newer data and Alter table to add new columns.

View solution in original post

sbouguerra · ‎08-31-2018

Please take a look at this page HDP-2.6-Hive-Druid

How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?

Use

"druid.query.granularity" = "HOUR"

Is there any support for "HyperUnique" in this workflow? looking on doing something like "unique user ids"

sorry NO complex metrics

One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?

Use Insert Into for newer data and Alter table to add new columns.

shaharc · ‎08-31-2018

Thanks @Slim!

I went over the doc, but im still not 100% sure on the relationship between the segment granularity and the output table. again shuld i be aggregating within the hive query at all? or can i output "raw" data and the granularity setting will aggregate for me (and if so, will it support sum/count etc out of the box?)

sbouguerra · ‎08-31-2018

If you are interested on rolling up SUM and Count then you output raw data as is and "druid.query.granularity"="HOUR" (FYI not segment granularity) will do the rollup for you. If you want to compute other rollups metrics like using MIN/MAX/AVG etc then you need to do the rollup before.

If you share an example of your use cases i can help explaining more.

Thanks.

Cloudera Community

Support Questions

Hive to Druid Methodology