- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Perform a join with an ARRAY column after a LIMIT is imposed.
- Labels:
-
Apache Impala
Created on ‎05-02-2017 11:10 AM - edited ‎09-16-2022 04:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Classification: //SecureWorks/Confidential - Limited External Distribution:
I have a table with a complex column which is an array of structs. I would like to run a query against the main table (not the complex fields) that has a LIMIT clause that limits to the top N rows. I would then like to join against the complex column. The results will be more than N rows. In the example below, I want to retrieve the 2 most valuable entries from the main table and also get back the full part list for each of those entries. Is this possible? Is there a way to do it efficiently (a single scan)?
In Hive:
hive> create external table main (id BIGINT, value BIGINT, summary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> select * from main;
1 5 part 1
2 10 part 2
3 8 part 3
hive> create external table parts (id BIGINT, part_id BIGINT, count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> select * from parts;
OK
1 11 5
1 13 6
2 11 4
2 15 9
2 7 4
3 11 1
hive> insert overwrite table example
> select main.id, value, summary, collect_set(named_struct('part_id',part_id,'count',count))
> from main
> join parts on main.id = parts.id
> group by main.id, value, summary;
hive> select * from example;
1 5 part 1 [{"part_id":11,"count":5},{"part_id":13,"count":6}]
2 10 part 2 [{"part_id":11,"count":4},{"part_id":15,"count":9},{"part_id":7,"count":4}]
3 8 part 3 [{"part_id":11,"count":1}]
In Impala: Show that the table is accessible.
Query: select * from example, example.parts
+----+-------+---------+---------+-------+
| id | value | summary | part_id | count |
+----+-------+---------+---------+-------+
| 1 | 5 | part 1 | 11 | 5 |
| 1 | 5 | part 1 | 13 | 6 |
| 2 | 10 | part 2 | 11 | 4 |
| 2 | 10 | part 2 | 15 | 9 |
| 2 | 10 | part 2 | 7 | 4 |
| 3 | 8 | part 3 | 11 | 1 |
+----+-------+---------+---------+-------+
In Impala: This query performs the limit after the join. It is not what I want. I want my query to return the ‘part 2’ and ‘part 3’ and all their subparts. That would be the last 4 rows in the query above.
Query: select * from example, example.parts order by value limit 2
+----+-------+---------+---------+-------+
| id | value | summary | part_id | count |
+----+-------+---------+---------+-------+
| 1 | 5 | part 1 | 13 | 6 |
| 1 | 5 | part 1 | 11 | 5 |
+----+-------+---------+---------+-------+
In Impala: Some failed attempts
[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, ex.parts;
Query: select * from (select * from example ex order by value desc limit 2) sub, ex.parts
ERROR: AnalysisException: Could not resolve table reference: 'ex.parts'
[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, sub.parts;
Query: select * from (select * from example ex order by value desc limit 2) sub, sub.parts
ERROR: AnalysisException: Could not resolve table reference: 'sub.parts'
Created ‎05-02-2017 10:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's an interesting scenario. Unfotunately, there is no elegant way to do what you ask in Impala's SQL.
The query you really want to write is this one:
select * from (select *, parts from example ex order by value desc limit 2) v, v.parts
The issue is that Impala currently does not support returning complex types in the select-list of an inline view:
https://issues.apache.org/jira/browse/IMPALA-2777
Note that '*' in Impala expands to all scalar-typed colmns, so you need to list 'parts' explicitly.
