Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to generate a TopN query with a Hive table linked to a Druid datasource

avatar

Hi,

Following the Hive documentation (https://cwiki.apache.org/confluence/display/Hive/Druid+Integration) I linked a Hive table to an existing Druid datasource using the DruidStorageHandler. I managed to do select, timeseries and groupBy queries without any trouble but impossible to generate a TopN query. It always falls back on the groupBy.

My datasource (with a DAY granularity) schema is :

+------------+------------+--------------------+--+
|  col_name  | data_type  |      comment       |
+------------+------------+--------------------+--+
| __time     | timestamp  | from deserializer  |
| dimension1 | string     | from deserializer  |
| metric1    | bigint     | from deserializer  |
| dimension2 | string     | from deserializer  |
+------------+------------+--------------------+--+

The query I'm running is :

SELECT `dimension1`, `floor_day`(`__time`), sum(`metric1`) as s FROM my_db.my_table GROUP BY `dimension1`, `floor_day`(`__time`) ORDER BY s DESC LIMIT 10;

If I 'EXPLAIN' it, I have :

Plan optimized by CBO.                                                                               Stage-0
    Fetch Operator
      limit:-1
      Select Operator [SEL_1]
        Output:["_col0","_col1","_col2"]
        TableScan [TS_0]
	  Output:["dimension1","floor_day","$f2"],properties:{"druid.query.json":"{\"queryType\":\"groupBy\",\"dataSource\":\"my_datasource\",\"granularity\":\"all\",\"dimensions\":[{\"type\":\"default\",\"dimension\":\"dimension1\"},{\"type\":\"extraction\",\"dimension\":\"__time\",\"outputName\":\"floor_day\",\"extractionFn\":{\"type\":\"timeFormat\",\"format\":\"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'\",\"granularity\":\"day\",\"timeZone\":\"UTC\",\"locale\":\"en-US\"}}],\"limitSpec\":{\"type\":\"default\",\"limit\":10,\"columns\":[{\"dimension\":\"$f2\",\"direction\":\"descending\",\"dimensionOrder\":\"numeric\"}]},\"aggregations\":[{\"type\":\"longSum\",\"name\":\"$f2\",\"fieldName\":\"count\"}],\"intervals\":[\"1900-01-01T00:00:00.000/3000-01-01T00:00:00.000\"]}","druid.query.type":"groupBy"}                                                                                   

I'm using HDP-2.6.2.0-205 with Hive 2.1.0 and Druid 0.9.2.

Is there any configuration I'm missing ? Is my query badly written ? I tried to stick to the doc as much as possible.

Thank you

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

I'm impressed to have such a quick reply ! Thanks 🙂

Understood, it's normal that it's falling back for now. However, maybe you should update the Hive documentation to say that TopN queries are currently unsupported ?

Final thing : it's great you're planning to add it with an approximate flag. Maybe it's too far to know, but any idea of when could we expect this feature ?

avatar
Expert Contributor

Thanks yes we will update the Wiki.

Am not 100%, but I hope this will make it to 2.6.3, we are pushing it to the finish line.

Thanks.