Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Hawq Timestamp datatype in Spark

avatar
New Member

Hello:

I have a table in Hawq which is stored as Parquet. The problem is reading the parquet file in Spark returns timestamp columns as Long. The table ddl is:

CREATE TABLE test ( location integer, year integer, week integer, startdateofweek timestamp without time zone, enddateofweek timestamp without time zone ) WITH (APPENDONLY=true, ORIENTATION=parquet, COMPRESSTYPE=snappy, OIDS=FALSE ) DISTRIBUTED RANDOMLY;

Note the columns startdateofweek and enddateofweek. These are created as timestamp type. Next I read this table in Spark using the parquet file location rather than connecting to Hawq using a JDBC connection.

val path = "/hawq_default/16385/312220/479532/*"

val df = spark.read.parquet(path)

df.printSchema

df.show(2)

This returns the timestamp columns as long rather than timestamp.

|-- location: integer (nullable = true)

|-- year: integer (nullable = true)

|-- week: integer (nullable = true)

|-- startdateofweek: long (nullable = true)

|-- enddateofweek: long (nullable = true)

+--------+----+----+---------------+---------------+

|location|year|week|startdateofweek| enddateofweek|

+--------+----+----+---------------+---------------+

| 2290|2015| 19|484012800000000|484531200000000|

| 2290|2015| 19|484012800000000|484531200000000|

I have tried many different things to convert this long to timestamp. But nothing works. Some of the things I have tried are:

1) Specifying the schema for dataset using a case class. Still returns as long.

case class myLog(location: Integer, year: Integer,week:Integer, startdateofweek:java.sql.Timestamp, enddateofweek:java.sql.Timestamp)

val path = "/hawq_default/16385/312220/479532/*"

val peopleDS = spark.read.parquet(path).as[myLog]

peopleDS.show(2)

2) Use from_unixtime function. This returns bogus dates.

select from_unixtime(startdateofweek/1000) as ts from TEST where location = 2290 and week = 19 and year = 2015

17307-10-03 20:00:00
17307-10-03 20:00:00

Any help is much appreciated in how to deal with this issue.

Thanks much

1 ACCEPTED SOLUTION

avatar
New Member

Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.

From the postgres 8.4 documentation

https://www.postgresql.org/docs/8.4/static/datatype-datetime.html

“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”

So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).

The below e.g. function changes the startdateofweek from dataframe back to date.

df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()

946684800000000L is for 30 Year offset in microseconds.

14400000000L is for 4 hours EST offset to GMT in microseconds.

.

View solution in original post

4 REPLIES 4

avatar
New Member

This looks to be a precision thing. Postgres 8.2 (which HAWQ is loosely based on) stores timestamps with microsecond precision, whereas the from_unixtime() function expects the number in seconds, which explains why you're a few centuries in the future 🙂

avatar
New Member

Thanks for the response. But converting from microseconds to seconds returns a date but not the right date.

In Spark SQL:

select from_unixtime(startdateofweek/1000000) as ts from TEST where location = 2290 and week = 19 and year = 2015

Returns:

1985-05-03 20:00:00

1985-05-03 20:00:00

In Postgres. Running the query:

select startdateofweek as ts from TEST_PARQUET where location = 2290 and week = 19 and year = 2015

Returns:

2015-05-04 00:00:00

avatar
New Member

Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.

From the postgres 8.4 documentation

https://www.postgresql.org/docs/8.4/static/datatype-datetime.html

“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”

So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).

The below e.g. function changes the startdateofweek from dataframe back to date.

df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()

946684800000000L is for 30 Year offset in microseconds.

14400000000L is for 4 hours EST offset to GMT in microseconds.

.

avatar
New Member

Great catch @Tarun Tiwari!