Support Questions

pierre_gunet · ‎10-18-2017

Hi,

Following the Hive documentation (https://cwiki.apache.org/confluence/display/Hive/Druid+Integration) I linked a Hive table to an existing Druid datasource using the DruidStorageHandler. I managed to do select, timeseries and groupBy queries without any trouble but impossible to generate a TopN query. It always falls back on the groupBy.

My datasource (with a DAY granularity) schema is :

+------------+------------+--------------------+--+
|  col_name  | data_type  |      comment       |
+------------+------------+--------------------+--+
| __time     | timestamp  | from deserializer  |
| dimension1 | string     | from deserializer  |
| metric1    | bigint     | from deserializer  |
| dimension2 | string     | from deserializer  |
+------------+------------+--------------------+--+

The query I'm running is :

SELECT `dimension1`, `floor_day`(`__time`), sum(`metric1`) as s FROM my_db.my_table GROUP BY `dimension1`, `floor_day`(`__time`) ORDER BY s DESC LIMIT 10;

If I 'EXPLAIN' it, I have :

Plan optimized by CBO.                                                                               Stage-0
    Fetch Operator
      limit:-1
      Select Operator [SEL_1]
        Output:["_col0","_col1","_col2"]
        TableScan [TS_0]
	  Output:["dimension1","floor_day","$f2"],properties:{"druid.query.json":"{\"queryType\":\"groupBy\",\"dataSource\":\"my_datasource\",\"granularity\":\"all\",\"dimensions\":[{\"type\":\"default\",\"dimension\":\"dimension1\"},{\"type\":\"extraction\",\"dimension\":\"__time\",\"outputName\":\"floor_day\",\"extractionFn\":{\"type\":\"timeFormat\",\"format\":\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",\"granularity\":\"day\",\"timeZone\":\"UTC\",\"locale\":\"en-US\"}}],\"limitSpec\":{\"type\":\"default\",\"limit\":10,\"columns\":[{\"dimension\":\"$f2\",\"direction\":\"descending\",\"dimensionOrder\":\"numeric\"}]},\"aggregations\":[{\"type\":\"longSum\",\"name\":\"$f2\",\"fieldName\":\"count\"}],\"intervals\":[\"1900-01-01T00:00:00.000/3000-01-01T00:00:00.000\"]}","druid.query.type":"groupBy"}

I'm using HDP-2.6.2.0-205 with Hive 2.1.0 and Druid 0.9.2.

Is there any configuration I'm missing ? Is my query badly written ? I tried to stick to the doc as much as possible.

Thank you

sbouguerra · ‎10-18-2017

Sorry currently we only issue Time-series and Group-by queries and it is by design. The reason we backed of TopN is that Druid's TopN algorithm is an approximate thus is not the exact result, that is why we use Group-by all the time. Since we are a SQL shop, we need to make sure that results are correct rather than running fast. Although we are adding a new feature to allow approximate results thus the CBO will use TopN when possible if the approximate flag is turned On. Thanks!

View solution in original post

sbouguerra · ‎10-18-2017

Sorry currently we only issue Time-series and Group-by queries and it is by design. The reason we backed of TopN is that Druid's TopN algorithm is an approximate thus is not the exact result, that is why we use Group-by all the time. Since we are a SQL shop, we need to make sure that results are correct rather than running fast. Although we are adding a new feature to allow approximate results thus the CBO will use TopN when possible if the approximate flag is turned On. Thanks!

pierre_gunet · ‎10-18-2017

I'm impressed to have such a quick reply ! Thanks 🙂

Understood, it's normal that it's falling back for now. However, maybe you should update the Hive documentation to say that TopN queries are currently unsupported ?

Final thing : it's great you're planning to add it with an approximate flag. Maybe it's too far to know, but any idea of when could we expect this feature ?

sbouguerra · ‎10-18-2017

Thanks yes we will update the Wiki.

Am not 100%, but I hope this will make it to 2.6.3, we are pushing it to the finish line.

Thanks.

Cloudera Community

Support Questions

How to generate a TopN query with a Hive table linked to a Druid datasource