About CsabaR

CsabaR · ‎05-02-2019

Hi! I can give a quick answer for Impala: reading int64 Parquet timestamps is implemented, but it is a quite new feature, released in CDH 6.2. The more widely supported way to store timestamps in Parquet is INT96, so if Pandas can write it that way, then both Hive and Impala will be able to read it. Note that there are more than one way to store a timestamp as int64 in Parquet (millisec vs microsec vs nanosec + utc vs local time). The way to interpret the int64 is stored in metadata. As far as I know, it is an ongoing work in Hive to support all possible formats. If you know which int64 format is used, e.g. microseconds utc, then it is also possible to read it as BIGINT and convert it to timestamp in the query, or create a view that does this conversion. Regards, Csaba

CsabaR · ‎10-26-2018

I have checked the writer in the file's metadata, and it is Parquet.Net version 2.1.4.298. So it seems that this is not an Impala reader issue, but a Parquet.Net writer issue. The definition levels of NULLs in collections are wrong (according to Parquet spec). This issue it causes is that if the first column read is the collection with NULL in the row, then the 0 def level is interpreted as "the whole row is NULL". If there is another (non NULL) column read first, then its def will be used to determine parents's NULLness, so it will not be NULL. This is why adding 'id' leads to returning the expected results. I would not consider this a bug, rather an optimisation (checking every columns's def level could affect performance). Parquet.Net is not part of CDH and is not an Apachee project at the moment. I am not familiar with the project, so I do not know whether this is a known issue or not. My advice is to contact the maintainer mentioned at https://github.com/elastacloud/parquet-dotnet

CsabaR · ‎10-25-2018

I was created such table using Hive, but I was unable to reproduce the issue: CREATE TABLE test_users_flat ( id INT, name STRING, device_id STRING, device_model STRING ) insert into test_users_flat values (1, "user A", "device A1", "phone A1"), (1, "user A", "device A2", "NULL"), (2, "user B", "device B1", "NULL"), (2, "user B", "device B2", "phone B2"), (2, "user B", "device B3", "phone B3"); INSERT INTO test_users SELECT id, name, collect_list(named_struct('id', device_id, 'device_info', named_struct('model', device_model))) FROM test_users_flat GROUP BY id, name; The resulting table could be read with Impala without problem: SELECT u.name, d.device_info.model as model FROM test_users u, u.devices d; result: 1 user A phone A1 2 user A NULL 3 user B NULL 4 user B phone B2 5 user B phone B3 Can you provide more information like CDH version and the exact steps used to fill the table?

CsabaR · ‎10-18-2018

Can you give more information about the way it was created? If we could reproduce it, then a ticket could be created for the writer compenent.

CsabaR · ‎10-17-2018

The attached Parquet file seems corrupt: the definition level of nested NULL values are incorrectly filled. What tool was used to create the file? For more information about the problem with file see https://issues.apache.org/jira/browse/IMPALA-7471?focusedCommentId=16650568&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16650568

Online	Offline
Last Visited	‎11-23-2022 08:08 AM

Member Since	‎10-17-2018 04:16 AM
Last Visited	‎11-23-2022 08:08 AM
Posts	5

Cloudera Community

Re: Python generated parquet timestamp error

Re: Impala bug with nested arrays of structures wh...

Re: Impala bug with nested arrays of structures wh...

Re: Impala bug with nested arrays of structures wh...

Re: Impala bug with nested arrays of structures wh...