Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Perform a join with an ARRAY column after a LIMIT is imposed.

avatar
Contributor

Classification: //SecureWorks/Confidential - Limited External Distribution:

 

I have a table with a complex column which is an array of structs.  I would like to run a query against the main table (not the complex fields) that has a LIMIT clause that limits to the top N rows.  I would then like to join against the complex column.  The results will be more than N rows.  In the example below, I want to retrieve the 2 most valuable entries from the main table and also get back the full part list for each of those entries.  Is this possible? Is there a way to do it efficiently (a single scan)? 

 

In Hive:

hive> create external table main (id BIGINT, value BIGINT, summary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from main;

1      5      part 1

2      10     part 2

3      8      part 3

 

hive> create external table parts (id BIGINT, part_id BIGINT, count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from parts;

OK

1      11     5

1      13     6

2      11     4

2      15     9

2      7      4

3      11     1

 

hive> insert overwrite table example

    > select main.id, value, summary, collect_set(named_struct('part_id',part_id,'count',count))

    > from main

    > join parts on main.id = parts.id

    > group by main.id, value, summary;

hive> select * from example;

1      5      part 1 [{"part_id":11,"count":5},{"part_id":13,"count":6}]

2      10     part 2 [{"part_id":11,"count":4},{"part_id":15,"count":9},{"part_id":7,"count":4}]

3      8      part 3 [{"part_id":11,"count":1}]

 

In Impala: Show that the table is accessible.

Query: select * from example, example.parts

+----+-------+---------+---------+-------+

| id | value | summary | part_id | count |

+----+-------+---------+---------+-------+

| 1  | 5     | part 1  | 11      | 5     |

| 1  | 5     | part 1  | 13      | 6     |

| 2  | 10    | part 2  | 11      | 4     |

| 2  | 10    | part 2  | 15      | 9     |

| 2  | 10    | part 2  | 7       | 4     |

| 3  | 8     | part 3  | 11      | 1     |

+----+-------+---------+---------+-------+

 

In Impala: This query performs the limit after the join.  It is not what I want.  I want my query to return the ‘part 2’ and ‘part 3’ and all their subparts.  That would be the last 4 rows in the query above.

Query: select * from example, example.parts order by value limit 2

+----+-------+---------+---------+-------+

| id | value | summary | part_id | count |

+----+-------+---------+---------+-------+

| 1  | 5     | part 1  | 13      | 6     |

| 1  | 5     | part 1  | 11      | 5     |

+----+-------+---------+---------+-------+

 

In Impala: Some failed attempts

[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, ex.parts;

Query: select * from (select * from example ex order by value desc limit 2) sub, ex.parts

ERROR: AnalysisException: Could not resolve table reference: 'ex.parts'

 

[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, sub.parts;

Query: select * from (select * from example ex order by value desc limit 2) sub, sub.parts

ERROR: AnalysisException: Could not resolve table reference: 'sub.parts'

 

 

1 REPLY 1

avatar

That's an interesting scenario. Unfotunately, there is no elegant way to do what you ask in Impala's SQL.

 

The query you really want to write is this one:

select * from (select *, parts from example ex order by value desc limit 2) v, v.parts

 

The issue is that Impala currently does not support returning complex types in the select-list of an inline view:

https://issues.apache.org/jira/browse/IMPALA-2777

 

Note that '*' in Impala expands to all scalar-typed colmns, so you need to list 'parts' explicitly.