Member since
08-14-2018
3
Posts
0
Kudos Received
0
Solutions
10-26-2018
03:29 AM
I have checked the writer in the file's metadata, and it is Parquet.Net version 2.1.4.298. So it seems that this is not an Impala reader issue, but a Parquet.Net writer issue. The definition levels of NULLs in collections are wrong (according to Parquet spec). This issue it causes is that if the first column read is the collection with NULL in the row, then the 0 def level is interpreted as "the whole row is NULL". If there is another (non NULL) column read first, then its def will be used to determine parents's NULLness, so it will not be NULL. This is why adding 'id' leads to returning the expected results. I would not consider this a bug, rather an optimisation (checking every columns's def level could affect performance). Parquet.Net is not part of CDH and is not an Apachee project at the moment. I am not familiar with the project, so I do not know whether this is a known issue or not. My advice is to contact the maintainer mentioned at https://github.com/elastacloud/parquet-dotnet
... View more