Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CDH Spark 1.6 parquet timezone problem

CDH Spark 1.6 parquet timezone problem

Explorer

Hi, I faced with timezone (or time) conversion problem when I read data in Spark 1.6. Here is example:

 

> impala-shell
create table test (dt timestamp) stored as parquet;
insert into test select cast('2015-01-01 00:00:00' as timestamp);
insert into test select cast('1900-01-01 00:00:00' as timestamp);
select * from test;
+---------------------+
| dt |
+---------------------+
| 2015-01-01 00:00:00 |
| 1900-01-01 00:00:00 |
+---------------------+

> hive
select * from test;
1900-01-01 00:00:00
2015-01-01 00:00:00
> spark-shell scala> sqlContext.sql("select * from test").collect(); res1: Array[org.apache.spark.sql.Row] = Array([1900-01-01 02:30:17.0], [2015-01-01 03:00:00.0])

I also found a similar issue in Apache Spark SPARK-10177 (fixed at v1.6). But I does not find this issue in CDH 5.5.x Release Notes (Spark 1.5 was firstly introduced in CDH 5.5). Is it possible that it is still not solved in CDH? Or any ideas how to find WA? The goal is to obtain timestamp from parquet independently from timezone.

Now I have CDH 5.8.

1 REPLY 1

Re: CDH Spark 1.6 parquet timezone problem

Expert Contributor

Cloudera doesn't remove features, they only backport some, so if the jira says it is fixed in 1.5, it will still be included in Cloudera's 1.5 with possibly other fixes as well.  You can also check to see if it is included in Cloudera's distribution in github: https://github.com/cloudera/spark/blob/cdh5-1.5.0_5.5.0/sql/catalyst/src/main/scala/org/apache/spark...

 

It does look like the jira below was fixed in CDH5.5.  Note though that Impala handles timesatmps differently than Hive and Spark.