Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hawq Timestamp datatype in Spark

Solved Go to solution
Highlighted

Hawq Timestamp datatype in Spark

New Contributor

Hello:

I have a table in Hawq which is stored as Parquet. The problem is reading the parquet file in Spark returns timestamp columns as Long. The table ddl is:

CREATE TABLE test ( location integer, year integer, week integer, startdateofweek timestamp without time zone, enddateofweek timestamp without time zone ) WITH (APPENDONLY=true, ORIENTATION=parquet, COMPRESSTYPE=snappy, OIDS=FALSE ) DISTRIBUTED RANDOMLY;

Note the columns startdateofweek and enddateofweek. These are created as timestamp type. Next I read this table in Spark using the parquet file location rather than connecting to Hawq using a JDBC connection.

val path = "/hawq_default/16385/312220/479532/*"

val df = spark.read.parquet(path)

df.printSchema

df.show(2)

This returns the timestamp columns as long rather than timestamp.

|-- location: integer (nullable = true)

|-- year: integer (nullable = true)

|-- week: integer (nullable = true)

|-- startdateofweek: long (nullable = true)

|-- enddateofweek: long (nullable = true)

+--------+----+----+---------------+---------------+

|location|year|week|startdateofweek| enddateofweek|

+--------+----+----+---------------+---------------+

| 2290|2015| 19|484012800000000|484531200000000|

| 2290|2015| 19|484012800000000|484531200000000|

I have tried many different things to convert this long to timestamp. But nothing works. Some of the things I have tried are:

1) Specifying the schema for dataset using a case class. Still returns as long.

case class myLog(location: Integer, year: Integer,week:Integer, startdateofweek:java.sql.Timestamp, enddateofweek:java.sql.Timestamp)

val path = "/hawq_default/16385/312220/479532/*"

val peopleDS = spark.read.parquet(path).as[myLog]

peopleDS.show(2)

2) Use from_unixtime function. This returns bogus dates.

select from_unixtime(startdateofweek/1000) as ts from TEST where location = 2290 and week = 19 and year = 2015

17307-10-03 20:00:00
17307-10-03 20:00:00

Any help is much appreciated in how to deal with this issue.

Thanks much

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hawq Timestamp datatype in Spark

New Contributor

Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.

From the postgres 8.4 documentation

https://www.postgresql.org/docs/8.4/static/datatype-datetime.html

“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”

So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).

The below e.g. function changes the startdateofweek from dataframe back to date.

df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()

946684800000000L is for 30 Year offset in microseconds.

14400000000L is for 4 hours EST offset to GMT in microseconds.

.

4 REPLIES 4

Re: Hawq Timestamp datatype in Spark

New Contributor

This looks to be a precision thing. Postgres 8.2 (which HAWQ is loosely based on) stores timestamps with microsecond precision, whereas the from_unixtime() function expects the number in seconds, which explains why you're a few centuries in the future :-)

Re: Hawq Timestamp datatype in Spark

New Contributor

Thanks for the response. But converting from microseconds to seconds returns a date but not the right date.

In Spark SQL:

select from_unixtime(startdateofweek/1000000) as ts from TEST where location = 2290 and week = 19 and year = 2015

Returns:

1985-05-03 20:00:00

1985-05-03 20:00:00

In Postgres. Running the query:

select startdateofweek as ts from TEST_PARQUET where location = 2290 and week = 19 and year = 2015

Returns:

2015-05-04 00:00:00

Re: Hawq Timestamp datatype in Spark

New Contributor

Hawq follows postgres 8.4 convention, according to which 2000-01-01 acts as a zero value and timestamp data is stored as 8-Byte signed integer with microsecond precision.

From the postgres 8.4 documentation

https://www.postgresql.org/docs/8.4/static/datatype-datetime.html

“When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values. ... .timestamp values are stored as seconds before or after midnight 2000-01-01.”

So the Hawq Date has a 30 Years offset against Unix epoch (1970-01-01).

The below e.g. function changes the startdateofweek from dataframe back to date.

df.select((($"startdateofweek" + 946684800000000L + 14400000000L)/1000000).cast(TimestampType)).show()

946684800000000L is for 30 Year offset in microseconds.

14400000000L is for 4 hours EST offset to GMT in microseconds.

.

Re: Hawq Timestamp datatype in Spark

New Contributor

Great catch @Tarun Tiwari!

Don't have an account?
Coming from Hortonworks? Activate your account here