Created on 08-31-2018 05:11 PM - edited 09-16-2022 06:39 AM
I'm looking into loading existing Hive tables stored as parquet to Druid so i wonder about a few things:
Thanks!
Shahar
Created 08-31-2018 05:34 PM
Please take a look at this page HDP-2.6-Hive-Druid
- How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?
Use
"druid.query.granularity" = "HOUR"
sorry NO complex metrics
- One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?
Use Insert Into for newer data and Alter table to add new columns.
Created 08-31-2018 05:34 PM
Please take a look at this page HDP-2.6-Hive-Druid
- How much pre-processing is needed on the hive table creation? should i "clean" the data such that there is not further aggregation on druid or is the granularity settings will take care of aggregation on the druid side? and if so, where should those aggregations be defined? (so if i want "HOUR" granularity should i pre-process the table to group by the hours already and do all the aggregations within Hive)?
Use
"druid.query.granularity" = "HOUR"
sorry NO complex metrics
- One of my challenges is that new metrics are added at a weekly/monthly basis. How will i support that if i need to load the data daily into druid? How would you handle schema evolution?
Use Insert Into for newer data and Alter table to add new columns.
Created 08-31-2018 05:41 PM
Thanks @Slim!
I went over the doc, but im still not 100% sure on the relationship between the segment granularity and the output table. again shuld i be aggregating within the hive query at all? or can i output "raw" data and the granularity setting will aggregate for me (and if so, will it support sum/count etc out of the box?)
Created 08-31-2018 05:57 PM
If you are interested on rolling up SUM and Count then you output raw data as is and "druid.query.granularity"="HOUR" (FYI not segment granularity) will do the rollup for you. If you want to compute other rollups metrics like using MIN/MAX/AVG etc then you need to do the rollup before.
If you share an example of your use cases i can help explaining more.
Thanks.