Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Timestamp (ms) error, when reading ORC file with the native implementation

Highlighted

Timestamp (ms) error, when reading ORC file with the native implementation

New Contributor

When reading ORC file with the native implementation, the milliseconds of timestamps are being doubled.


The issue seems to be reported as a known issues by HDP2.6.5, which I 'm currently using as the environment.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_release-notes/content/known_issues.html - BUG-103805


However, when reproducing the test case which is described on SPARK-24322, I 'm dealing with a slightly different issue.


The microseconds are correctly being read, but the milliseconds are being doubled.


Below, I attached my test results using Apache Zeppelin:

import spark.sql
import spark.implicits._

"""
Ref: https://issues.apache.org/jira/browse/SPARK-24322
"""
spark.version

spark.sql("set spark.sql.orc.impl=native")

Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.021798")).toDF().write.orc("/tmp/orc_test-native")
spark.read.orc("/tmp/orc_test-native").show(false)

which results into:

import spark.sql
import spark.implicits._
res196: String =
"
Ref: https://issues.apache.org/jira/browse/SPARK-24322
"
res199: String = 2.3.0.2.6.5.0-292
res201: org.apache.spark.sql.DataFrame = [key: string, value: string]
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:56.042798|
+--------------------------+


Any chance I 'm facing the same issue?