Created on 07-26-2016 09:10 AM - edited 09-16-2022 03:31 AM
Hi,
how can I access the position of the element (in my case of the MAP) in the array column?
TABLE DDL:
CREATE TABLE mytable (
id int,
location ARRAY< MAP< string, string> > )
When I query the table with this SQL I got a position of the KEY-VALUE pair INSIDE the MAP, but what I really want to access is the position in the location ARRAY, e.g. 1st location, 2nd location and so on.
select c.id, a.pos, a.key, a.value from mytable c left join c.locations a order by c.id, a.pos;
Thanks
Created 07-29-2016 11:57 AM
Still investigating and working on the JIRA. In the meantimg, I think the query that you mean is this:
select t.id, l.pos as location_number, m.key, m.value from mytable t join t.locations l join l.item m order by id, l.pos;
The bug is that those queries returning wrong results are not semantically correct but Impala runs them anyway and gives "arbitrary" results.
Created 07-26-2016 01:57 PM
Hi Thomas,
in your example query the table alias 'a' has a 'pos' pseudo-column that refers to the element-position within the ARRAY, so I think it's indeed what you want.
However, I think there is a bug here that may lead to confusion.
The query you wrote "should" be illegal because you should not be able to access a.key and a.value without referencing the nested map in the FROM clause. I will follow-up with a JIRA, stay tuned.
Can you try running this and see if you get the expected results?
select c.id, a.pos from mytable c left join c.`location` a order by c.id, a.pos;
Alternatively try this query if you also want to explode the items within the nested MAP:
select c.id, a.pos, a.key, a.value from mytable c left join c.`location` a, a.item order by c.id, a.pos;
Created 07-27-2016 12:43 AM
Hi,
thanks for the response, yes it seems to be a bug. You can test it yourself, here are the scripts to reproduce:
(in Hive, "i" is just a dummy table):
create table mytable( id int, locations array< map<string,string>>) stored as parquet; insert into table mytable select 10001, array( map("Location1.City","Bratislava","Location1.Country","SK","Location1.LAT","41"), map("Location2.City","Kosice","Location2.Country","SK","Location2.LAT","42") ) from i limit 1; insert into table mytable select 10002, array( map("Location1.City","Wien","Location1.Country","AT","Location1.LAT","40"), map("Location2.City","Graz","Location2.Country","AT","Location2.LAT","40") ) from i limit 1;
Query results in Impala:
These results are wrond, what I need is a location_number -
0 for Location1.City
0 for Location1.Country
0 for Location1.LAT
1 for Location2.City
1 for Location2.Country
1 for Location2.LAT
... and so on for every ID in the table.
What you suggested (adding l.item), returned a wrong result set, because City in the result set is twice, but LAT is missing - this seems to be definitely a BUG:
Created 07-27-2016 01:34 AM
In Hive the solution is:
select id, ix, locations.key, locations.val from ( select id, ix, locs from mytable lateral view posexplode( locations ) x as ix,locs ) tab lateral view explode( locs ) locations as key,val;
which returns the correct result set:
Created 07-29-2016 02:19 AM
Created 07-29-2016 11:57 AM
Still investigating and working on the JIRA. In the meantimg, I think the query that you mean is this:
select t.id, l.pos as location_number, m.key, m.value from mytable t join t.locations l join l.item m order by id, l.pos;
The bug is that those queries returning wrong results are not semantically correct but Impala runs them anyway and gives "arbitrary" results.
Created 07-29-2016 12:09 PM
Filed the JIRA: https://issues.cloudera.org/browse/IMPALA-3938
Thanks a lot for your detailed report and easy reproduction!