Support Questions

Find answers, ask questions, and share your expertise

Impala bug with nested arrays of structures where some of the fields contains null

avatar
New Contributor
Hi All,
 
We found a case where Impala returns incorrect values from simple query. Our data contains nested array of structures and structures contains other structures.
We generated minimal sample data allowing to reproduce the issue.
 
SQL to create a table:
CREATE TABLE plat_test.test_users (
  id INT,
  name STRING,   
  devices ARRAY<
    STRUCT<
      id:STRING,
      device_info:STRUCT<
        model:STRING
      >
    >
  >
)
STORED AS PARQUET
 
Please put attached parquet file to the location of the table and refresh the table.
In sample data we have 2 users, one with 2 devices, second one with 3. Some of the devices.device_info.model fields are NULL.
 
When I issue a query:
SELECT u.name, d.device_info.model as model
FROM test_users u,
u.devices d;
 
I'm expecting to get 5 records in results, but getting only one1.png
If I change query to:
SELECT u.name, d.device_info.model as model
FROM test_users u
LEFT OUTER JOIN u.devices d;
 
I'm getting two records in the results, but still not as it should be:
 
2.png
We found some workaround to this problem. If we add to the result columns device.id we will get all records from parquet file:
SELECT u.name, d.id, d.device_info.model as model
FROM test_users u
, u.devices d
 
And result is
 
3.png
 
But we can't rely on this workaround, because we don't need device.id in all queries and Impala optimizes it, and as a result we are getting unpredicted results.
 
I tested Hive query on this table and it returns expected results:
SELECT u.name, d.device_info.model
FROM test_users u
lateral view outer inline (u.devices) d;
 
results:
4.png
Please advice if it's a problem in Impala engine or we did some mistake in our query.
 
 
Best regards,
Come2Play team.
7 REPLIES 7

avatar

Hi @Yurii thanks for the bug report - we'll look into it. What version of Impala did you see this on?

 

One thing worth trying is changing the PARQUET_ARRAY_RESOLUTION query option to THREE_LEVEL

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_array_resolution.html

 

We have sometimes seen similar symptoms with parquet nested types because of some ambiguity in encodings, e.g. https://issues.apache.org/jira/browse/IMPALA-4725. The original version of Impala's nested types defaulted to detecting an older representation of arrays first and we had to keep that behaviour in Impala 2.x/CDH5.x for backwards compatibility. In Impala 3.0/CDH6.0 we're defaulting to the standard array resolution method.

avatar
Cloudera Employee

The attached Parquet file seems corrupt: the definition level of nested NULL values are incorrectly filled.

 

What tool was used to create the file?

 

For more information about the problem with file see https://issues.apache.org/jira/browse/IMPALA-7471?focusedCommentId=16650568&page=com.atlassian.jira....

avatar
New Contributor
We got the Parquet file from HDFS created by CDH

avatar
Cloudera Employee

Can you give more information about the way it was created?

 

If we could reproduce it, then a ticket could be created for the writer compenent.

avatar
New Contributor
We generated minimal sample data allowing to reproduce the issue.

SQL to create a table:
CREATE TABLE plat_test.test_users (
id INT,
name STRING,
devices ARRAY<
STRUCT<
id:STRING,
device_info:STRUCT<
model:STRING
>
>
>
)
STORED AS PARQUET

insert some records and download the file via HUE interface

avatar
Cloudera Employee

I was created such table using Hive, but I was unable to reproduce the issue:

CREATE TABLE test_users_flat (
id INT,
name STRING,
device_id STRING,
device_model STRING
)

 

insert into test_users_flat values
(1, "user A", "device A1", "phone A1"),
(1, "user A", "device A2", "NULL"),
(2, "user B", "device B1", "NULL"),
(2, "user B", "device B2", "phone B2"),
(2, "user B", "device B3", "phone B3");

 

INSERT INTO test_users
SELECT id, name, collect_list(named_struct('id', device_id, 'device_info', named_struct('model', device_model)))
FROM test_users_flat GROUP BY id, name;

 

The resulting table could be read with Impala without problem:

 

SELECT u.name, d.device_info.model as model
FROM test_users u,
u.devices d;

 

result:

1 user A phone A1
2 user A NULL
3 user B NULL
4 user B phone B2
5 user B phone B3

 

Can you provide more information like CDH version and the exact steps used to fill the table?

avatar
Cloudera Employee

I have checked the writer in the file's metadata, and it is Parquet.Net version 2.1.4.298.

 

So it seems that this is not an Impala reader issue, but a Parquet.Net writer issue. The definition levels of NULLs in collections are wrong (according to Parquet spec).

 

This issue it causes is that if the first column read is the collection with NULL in the row, then the 0 def level is interpreted as "the whole row is NULL". If there is another (non NULL) column read first, then its def will be used to determine parents's NULLness, so it will not be NULL. This is why adding 'id' leads to returning the expected results. I would not consider this a bug, rather an optimisation (checking every columns's def level could affect performance).

 

Parquet.Net is not part of CDH and is not an Apachee project at the moment. I am not familiar with the project, so I do not know whether this is a known issue or not. My advice is to contact the maintainer mentioned at https://github.com/elastacloud/parquet-dotnet