Created on 10-18-2017 01:44 PM - edited 09-16-2022 05:25 AM
Hi,
Following the Hive documentation (https://cwiki.apache.org/confluence/display/Hive/Druid+Integration) I linked a Hive table to an existing Druid datasource using the DruidStorageHandler. I managed to do select, timeseries and groupBy queries without any trouble but impossible to generate a TopN query. It always falls back on the groupBy.
My datasource (with a DAY granularity) schema is :
+------------+------------+--------------------+--+ | col_name | data_type | comment | +------------+------------+--------------------+--+ | __time | timestamp | from deserializer | | dimension1 | string | from deserializer | | metric1 | bigint | from deserializer | | dimension2 | string | from deserializer | +------------+------------+--------------------+--+
The query I'm running is :
SELECT `dimension1`, `floor_day`(`__time`), sum(`metric1`) as s FROM my_db.my_table GROUP BY `dimension1`, `floor_day`(`__time`) ORDER BY s DESC LIMIT 10;
If I 'EXPLAIN' it, I have :
Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1","_col2"] TableScan [TS_0] Output:["dimension1","floor_day","$f2"],properties:{"druid.query.json":"{\"queryType\":\"groupBy\",\"dataSource\":\"my_datasource\",\"granularity\":\"all\",\"dimensions\":[{\"type\":\"default\",\"dimension\":\"dimension1\"},{\"type\":\"extraction\",\"dimension\":\"__time\",\"outputName\":\"floor_day\",\"extractionFn\":{\"type\":\"timeFormat\",\"format\":\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",\"granularity\":\"day\",\"timeZone\":\"UTC\",\"locale\":\"en-US\"}}],\"limitSpec\":{\"type\":\"default\",\"limit\":10,\"columns\":[{\"dimension\":\"$f2\",\"direction\":\"descending\",\"dimensionOrder\":\"numeric\"}]},\"aggregations\":[{\"type\":\"longSum\",\"name\":\"$f2\",\"fieldName\":\"count\"}],\"intervals\":[\"1900-01-01T00:00:00.000/3000-01-01T00:00:00.000\"]}","druid.query.type":"groupBy"}
I'm using HDP-2.6.2.0-205 with Hive 2.1.0 and Druid 0.9.2.
Is there any configuration I'm missing ? Is my query badly written ? I tried to stick to the doc as much as possible.
Thank you
Created 10-18-2017 01:53 PM
Sorry currently we only issue Time-series and Group-by queries and it is by design. The reason we backed of TopN is that Druid's TopN algorithm is an approximate thus is not the exact result, that is why we use Group-by all the time. Since we are a SQL shop, we need to make sure that results are correct rather than running fast. Although we are adding a new feature to allow approximate results thus the CBO will use TopN when possible if the approximate flag is turned On. Thanks!
Created 10-18-2017 01:53 PM
Sorry currently we only issue Time-series and Group-by queries and it is by design. The reason we backed of TopN is that Druid's TopN algorithm is an approximate thus is not the exact result, that is why we use Group-by all the time. Since we are a SQL shop, we need to make sure that results are correct rather than running fast. Although we are adding a new feature to allow approximate results thus the CBO will use TopN when possible if the approximate flag is turned On. Thanks!
Created 10-18-2017 02:03 PM
I'm impressed to have such a quick reply ! Thanks 🙂
Understood, it's normal that it's falling back for now. However, maybe you should update the Hive documentation to say that TopN queries are currently unsupported ?
Final thing : it's great you're planning to add it with an approximate flag. Maybe it's too far to know, but any idea of when could we expect this feature ?
Created 10-18-2017 02:11 PM
Thanks yes we will update the Wiki.
Am not 100%, but I hope this will make it to 2.6.3, we are pushing it to the finish line.
Thanks.