Reply
Explorer
Posts: 8
Registered: ‎10-16-2014

Perform a join with an ARRAY column after a LIMIT is imposed.

Classification: //SecureWorks/Confidential - Limited External Distribution:

 

I have a table with a complex column which is an array of structs.  I would like to run a query against the main table (not the complex fields) that has a LIMIT clause that limits to the top N rows.  I would then like to join against the complex column.  The results will be more than N rows.  In the example below, I want to retrieve the 2 most valuable entries from the main table and also get back the full part list for each of those entries.  Is this possible? Is there a way to do it efficiently (a single scan)? 

 

In Hive:

hive> create external table main (id BIGINT, value BIGINT, summary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from main;

1      5      part 1

2      10     part 2

3      8      part 3

 

hive> create external table parts (id BIGINT, part_id BIGINT, count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from parts;

OK

1      11     5

1      13     6

2      11     4

2      15     9

2      7      4

3      11     1

 

hive> insert overwrite table example

    > select main.id, value, summary, collect_set(named_struct('part_id',part_id,'count',count))

    > from main

    > join parts on main.id = parts.id

    > group by main.id, value, summary;

hive> select * from example;

1      5      part 1 [{"part_id":11,"count":5},{"part_id":13,"count":6}]

2      10     part 2 [{"part_id":11,"count":4},{"part_id":15,"count":9},{"part_id":7,"count":4}]

3      8      part 3 [{"part_id":11,"count":1}]

 

In Impala: Show that the table is accessible.

Query: select * from example, example.parts

+----+-------+---------+---------+-------+

| id | value | summary | part_id | count |

+----+-------+---------+---------+-------+

| 1  | 5     | part 1  | 11      | 5     |

| 1  | 5     | part 1  | 13      | 6     |

| 2  | 10    | part 2  | 11      | 4     |

| 2  | 10    | part 2  | 15      | 9     |

| 2  | 10    | part 2  | 7       | 4     |

| 3  | 8     | part 3  | 11      | 1     |

+----+-------+---------+---------+-------+

 

In Impala: This query performs the limit after the join.  It is not what I want.  I want my query to return the ‘part 2’ and ‘part 3’ and all their subparts.  That would be the last 4 rows in the query above.

Query: select * from example, example.parts order by value limit 2

+----+-------+---------+---------+-------+

| id | value | summary | part_id | count |

+----+-------+---------+---------+-------+

| 1  | 5     | part 1  | 13      | 6     |

| 1  | 5     | part 1  | 11      | 5     |

+----+-------+---------+---------+-------+

 

In Impala: Some failed attempts

[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, ex.parts;

Query: select * from (select * from example ex order by value desc limit 2) sub, ex.parts

ERROR: AnalysisException: Could not resolve table reference: 'ex.parts'

 

[i-pvd1c1r2data01:21000] > select * from (select * from example ex order by value desc limit 2) sub, sub.parts;

Query: select * from (select * from example ex order by value desc limit 2) sub, sub.parts

ERROR: AnalysisException: Could not resolve table reference: 'sub.parts'

 

 

Highlighted
Cloudera Employee
Posts: 251
Registered: ‎10-16-2013

Re: Perform a join with an ARRAY column after a LIMIT is imposed.

That's an interesting scenario. Unfotunately, there is no elegant way to do what you ask in Impala's SQL.

 

The query you really want to write is this one:

select * from (select *, parts from example ex order by value desc limit 2) v, v.parts

 

The issue is that Impala currently does not support returning complex types in the select-list of an inline view:

https://issues.apache.org/jira/browse/IMPALA-2777

 

Note that '*' in Impala expands to all scalar-typed colmns, so you need to list 'parts' explicitly.

Announcements