Created 03-08-2017 09:14 PM
Hello:
I have a table in Hawq which is stored as Parquet. The problem is reading the parquet file in Spark returns timestamp columns as Long. The table ddl is:
CREATE TABLE test ( location integer, year integer, week integer, startdateofweek timestamp without time zone, enddateofweek timestamp without time zone ) WITH (APPENDONLY=true, ORIENTATION=parquet, COMPRESSTYPE=snappy, OIDS=FALSE ) DISTRIBUTED RANDOMLY;
Note the columns startdateofweek and enddateofweek. These are created as timestamp type. Next I read this table in Spark using the parquet file location rather than connecting to Hawq using a JDBC connection.
val path = "/hawq_default/16385/312220/479532/*"
val df = spark.read.parquet(path)
df.printSchema
df.show(2)
This returns the timestamp columns as long rather than timestamp.
|-- location: integer (nullable = true)
|-- year: integer (nullable = true)
|-- week: integer (nullable = true)
|-- startdateofweek: long (nullable = true)
|-- enddateofweek: long (nullable = true)
+--------+----+----+---------------+---------------+
|location|year|week|startdateofweek| enddateofweek|
+--------+----+----+---------------+---------------+
| 2290|2015| 19|484012800000000|484531200000000|
| 2290|2015| 19|484012800000000|484531200000000|
I have tried many different things to convert this long to timestamp. But nothing works. Some of the things I have tried are:
1) Specifying the schema for dataset using a case class. Still returns as long.
case class myLog(location: Integer, year: Integer,week:Integer, startdateofweek:java.sql.Timestamp, enddateofweek:java.sql.Timestamp)
val path = "/hawq_default/16385/312220/479532/*"
val peopleDS = spark.read.parquet(path).as[myLog]
peopleDS.show(2)
2) Use from_unixtime function. This returns bogus dates.
select from_unixtime(startdateofweek/1000) as ts from TEST where location = 2290 and week = 19 and year = 2015
17307-10-03 20:00:00 |
17307-10-03 20:00:00 |
Any help is much appreciated in how to deal with this issue.
Thanks much
Created 03-14-2017 03:25 AM
Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.
From the postgres 8.4 documentation
https://www.postgresql.org/docs/8.4/static/datatype-datetime.html
“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”
So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).
The below e.g. function changes the startdateofweek from dataframe back to date.
df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()
946684800000000L is for 30 Year offset in microseconds.
14400000000L is for 4 hours EST offset to GMT in microseconds.
.
Created 03-08-2017 09:23 PM
This looks to be a precision thing. Postgres 8.2 (which HAWQ is loosely based on) stores timestamps with microsecond precision, whereas the from_unixtime() function expects the number in seconds, which explains why you're a few centuries in the future 🙂
Created 03-08-2017 09:32 PM
Thanks for the response. But converting from microseconds to seconds returns a date but not the right date.
In Spark SQL:
select from_unixtime(startdateofweek/1000000) as ts from TEST where location = 2290 and week = 19 and year = 2015
Returns:
1985-05-03 20:00:00
1985-05-03 20:00:00
In Postgres. Running the query:
select startdateofweek as ts from TEST_PARQUET where location = 2290 and week = 19 and year = 2015
Returns:
2015-05-04 00:00:00
Created 03-14-2017 03:25 AM
Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.
From the postgres 8.4 documentation
https://www.postgresql.org/docs/8.4/static/datatype-datetime.html
“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”
So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).
The below e.g. function changes the startdateofweek from dataframe back to date.
df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()
946684800000000L is for 30 Year offset in microseconds.
14400000000L is for 4 hours EST offset to GMT in microseconds.
.
Created 03-14-2017 04:47 PM
Great catch @Tarun Tiwari!