question Re: Hawq Timestamp datatype in Spark in Archives of Support Questions (Read Only)

Hawq Timestamp datatype in Spark

Viraj_Thakur — Thu, 09 Mar 2017 05:14:24 GMT

Hello:

I have a table in Hawq which is stored as Parquet. The problem is reading the parquet file in Spark returns timestamp columns as Long. The table ddl is:

CREATE TABLE test ( location integer, year integer, week integer, startdateofweek timestamp without time zone, enddateofweek timestamp without time zone ) WITH (APPENDONLY=true, ORIENTATION=parquet, COMPRESSTYPE=snappy, OIDS=FALSE ) DISTRIBUTED RANDOMLY;

Note the columns startdateofweek and enddateofweek. These are created as timestamp type. Next I read this table in Spark using the parquet file location rather than connecting to Hawq using a JDBC connection.

val path = "/hawq_default/16385/312220/479532/*"

val df = spark.read.parquet(path)

df.printSchema

df.show(2)

This returns the timestamp columns as long rather than timestamp.

|-- location: integer (nullable = true)

|-- year: integer (nullable = true)

|-- week: integer (nullable = true)

|-- startdateofweek: long (nullable = true)

|-- enddateofweek: long (nullable = true)

+--------+----+----+---------------+---------------+

+--------+----+----+---------------+---------------+

| 2290|2015| 19|484012800000000|484531200000000|

| 2290|2015| 19|484012800000000|484531200000000|

I have tried many different things to convert this long to timestamp. But nothing works. Some of the things I have tried are:

1) Specifying the schema for dataset using a case class. Still returns as long.

case class myLog(location: Integer, year: Integer,week:Integer, startdateofweek:java.sql.Timestamp, enddateofweek:java.sql.Timestamp)

val path = "/hawq_default/16385/312220/479532/*"

val peopleDS = spark.read.parquet(path).as[myLog]

peopleDS.show(2)

2) Use from_unixtime function. This returns bogus dates.

select from_unixtime(startdateofweek/1000) as ts from TEST where location = 2290 and week = 19 and year = 2015

17307-10-03 20:00:00

Any help is much appreciated in how to deal with this issue.

Thanks much

Re: Hawq Timestamp datatype in Spark

kdunn926 — Thu, 09 Mar 2017 05:23:45 GMT

This looks to be a precision thing. Postgres 8.2 (which HAWQ is loosely based on) stores timestamps with microsecond precision, whereas the from_unixtime() function expects the number in seconds, which explains why you're a few centuries in the future 🙂

Re: Hawq Timestamp datatype in Spark

Viraj_Thakur — Thu, 09 Mar 2017 05:32:27 GMT

Thanks for the response. But converting from microseconds to seconds returns a date but not the right date.

In Spark SQL:

select from_unixtime(startdateofweek/1000000) as ts from TEST where location = 2290 and week = 19 and year = 2015

Returns:

1985-05-03 20:00:00

In Postgres. Running the query:

select startdateofweek as ts from TEST_PARQUET where location = 2290 and week = 19 and year = 2015

Returns:

2015-05-04 00:00:00

Re: Hawq Timestamp datatype in Spark

tarun_tiwari — Tue, 14 Mar 2017 10:25:32 GMT

Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.

From the postgres 8.4 documentation

https://www.postgresql.org/docs/8.4/static/datatype-datetime.html

“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”

So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).

The below e.g. function changes the startdateofweek from dataframe back to date.

df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()

946684800000000L is for 30 Year offset in microseconds.

14400000000L is for 4 hours EST offset to GMT in microseconds.

Re: Hawq Timestamp datatype in Spark

kdunn926 — Tue, 14 Mar 2017 23:47:04 GMT

Great catch @Tarun Tiwari!