We are generating parquet file using Python pandas library on a text file. The text file has a field value '2019-04-01 00:00:00.000', that is converted to format '2019-04-01 00:00:00+00:00 ' with data type 'datetime64[ns, UTC]'. The parquet file conversion is successful however while firing a select a query on the Hive external table on this specific column throws an error
Could you pelase guide what could be possible reason for it. We don't want the data type for this column to be STRING. As partial data will be sqoop from RDBMS and later will sent in Parquet format weekly/monthly/quarterly/yearly.
I can give a quick answer for Impala: reading int64 Parquet timestamps is implemented, but it is a quite new feature, released in CDH 6.2.
The more widely supported way to store timestamps in Parquet is INT96, so if Pandas can write it that way, then both Hive and Impala will be able to read it.
Note that there are more than one way to store a timestamp as int64 in Parquet (millisec vs microsec vs nanosec + utc vs local time). The way to interpret the int64 is stored in metadata. As far as I know, it is an ongoing work in Hive to support all possible formats.
If you know which int64 format is used, e.g. microseconds utc, then it is also possible to read it as BIGINT and convert it to timestamp in the query, or create a view that does this conversion.