Created on 05-02-2017 11:10 AM - edited 09-16-2022 04:33 AM
Classification: //SecureWorks/Confidential - Limited External Distribution:
I have a table with a complex column which is an array of structs. I would like to run a query against the main table (not the complex fields) that has a LIMIT clause that limits to the top N rows. I would then like to join against the complex column. The results will be more than N rows. In the example below, I want to retrieve the 2 most valuable entries from the main table and also get back the full part list for each of those entries. Is this possible? Is there a way to do it efficiently (a single scan)?
In Hive:
hive> create external table main (id BIGINT, value BIGINT, summary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> select * from main;
1 5 part 1
2 10 part 2
3 8 part 3
hive> create external table parts (id BIGINT, part_id BIGINT, count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> select * from parts;
OK
1 11 5
1 13 6
2 11 4
2 15 9
2 7 4
3 11 1
hive> insert overwrite table example
> select main.id, value, summary, collect_set(named_struct('part_id',part_id,'count',count))
> from main
> join parts on main.id = parts.id
> group by main.id, value, summary;
hive> select * from example;
1 5 part 1 [{"part_id":11,"count":5},{"part_id":13,"count":6}]
2 10 part 2 [{"part_id":11,"count":4},{"part_id":15,"count":9},{"part_id":7,"count":4}]
3 8 part 3 [{"part_id":11,"count":1}]
In Impala: Show that the table is accessible.
Query: select * from example, example.parts
+----+-------+---------+---------+-------+
| id | value | summary | part_id | count |
+----+-------+---------+---------+-------+
| 1 | 5 | part 1 | 11 | 5 |
| 1 | 5 | part 1 | 13 | 6 |
| 2 | 10 | part 2 | 11 | 4 |
| 2 | 10 | part 2 | 15 | 9 |
| 2 | 10 | part 2 | 7 | 4 |
| 3 | 8 | part 3 | 11 | 1 |
+----+-------+---------+---------+-------+
In Impala: This query performs the limit after the join. It is not what I want. I want my query to return the ‘part 2’ and ‘part 3’ and all their subparts. That would be the last 4 rows in the query above.
Query: select * from example, example.parts order by value limit 2
+----+-------+---------+---------+-------+
| id | value | summary | part_id | count |
+----+-------+---------+---------+-------+
| 1 | 5 | part 1 | 13 | 6 |
| 1 | 5 | part 1 | 11 | 5 |
+----+-------+---------+---------+-------+
In Impala: Some failed attempts
[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, ex.parts;
Query: select * from (select * from example ex order by value desc limit 2) sub, ex.parts
ERROR: AnalysisException: Could not resolve table reference: 'ex.parts'
[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, sub.parts;
Query: select * from (select * from example ex order by value desc limit 2) sub, sub.parts
ERROR: AnalysisException: Could not resolve table reference: 'sub.parts'
Created 05-02-2017 10:12 PM
That's an interesting scenario. Unfotunately, there is no elegant way to do what you ask in Impala's SQL.
The query you really want to write is this one:
select * from (select *, parts from example ex order by value desc limit 2) v, v.parts
The issue is that Impala currently does not support returning complex types in the select-list of an inline view:
https://issues.apache.org/jira/browse/IMPALA-2777
Note that '*' in Impala expands to all scalar-typed colmns, so you need to list 'parts' explicitly.