<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: ORC Table Timestamp PySpark 2.1 CASTIssue in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205961#M62712</link>
    <description>&lt;P&gt;Also see the below the structure of the dataframe before the write method is called&lt;/P&gt;&lt;PRE&gt;DataFrame[vehicle_hdr: string, vehicle_no: string, incident_timestamp: string]&lt;/PRE&gt;</description>
    <pubDate>Tue, 13 Jun 2017 12:52:45 GMT</pubDate>
    <dc:creator>jayadeep_jayara</dc:creator>
    <dc:date>2017-06-13T12:52:45Z</dc:date>
    <item>
      <title>ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205956#M62707</link>
      <description>&lt;P&gt;All,&lt;/P&gt;&lt;P&gt;I have a table which has 3 columns and is in ORC format, the data is as below&lt;/P&gt;&lt;PRE&gt;+--------------+-------------+--------------------------+--+
| vehicle_hdr  | vehicle_no  |    incident_timestamp    |
+--------------+-------------+--------------------------+--+
| XXXX         | 3911        | 1969-06-19 06:57:26.485  |
| XXXX         | 3911        | 1988-06-21 05:36:22.35   |&lt;/PRE&gt;&lt;P&gt;The DDL for the table is as below&lt;/P&gt;&lt;PRE&gt;create table test (vehicle_hdr string,vehicle_no string,incident_timestamp timestamp)stored as ORC;&lt;/PRE&gt;&lt;P&gt;From the hive beeline I am able to view the results but when I am using PySpark 2.1 and running the below code&lt;/P&gt;&lt;PRE&gt;o1 = sqlContext.sql("select vehicle_hdr, incident_timestamp  from test")&lt;/PRE&gt;&lt;P&gt;I am getting the below error&lt;/P&gt;&lt;PRE&gt;Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39)
        at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:393)
        at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:392)
        at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:416)
        at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:408)
        at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$anon$11.next(Iterator.scala:328)&lt;/PRE&gt;</description>
      <pubDate>Mon, 12 Jun 2017 14:10:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205956#M62707</guid>
      <dc:creator>jayadeep_jayara</dc:creator>
      <dc:date>2017-06-12T14:10:15Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205957#M62708</link>
      <description>&lt;P&gt;Hi @&lt;A href="https://community.hortonworks.com/users/13072/jayadeepjayaraman.html"&gt;Jayadeep Jayaraman&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I have just tested the same in pyspark2.1. That works fine my site. See below:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;beeline&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;0: jdbc:hive2://dkhdp262.openstacklocal:2181,&amp;gt; create table test_orc (b string,t timestamp) stored as ORC;
0: jdbc:hive2://dkhdp262.openstacklocal:2181,&amp;gt; select * from test_orc;
+-------------+------------------------+--+
| test_orc.b  |       test_orc.t       |
+-------------+------------------------+--+
| a           | 2017-06-13 05:02:23.0  |
| b           | 2017-06-13 05:02:23.0  |
| c           | 2017-06-13 05:02:23.0  |
| d           | 2017-06-13 05:02:23.0  |
| e           | 2017-06-13 05:02:23.0  |
| f           | 2017-06-13 05:02:23.0  |
| g           | 2017-06-13 05:02:23.0  |
| h           | 2017-06-13 05:02:23.0  |
| i           | 2017-06-13 05:02:23.0  |
| j           | 2017-06-13 05:02:23.0  |
+-------------+------------------------+--+
10 rows selected (0.091 seconds)
&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;pyspark&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;[root@dkhdp262 ~]# export SPARK_MAJOR_VERSION=2
[root@dkhdp262 ~]# pyspark
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.1.0-129
      /_/


Using Python version 2.7.5 (default, Jun 17 2014 18:11:42)
SparkSession available as 'spark'.
&amp;gt;&amp;gt;&amp;gt; sqlContext.sql("select b, t from test_orc").show()
+---+--------------------+
|  b|                   t|
+---+--------------------+
|  a|2017-06-13 05:02:...|
|  b|2017-06-13 05:02:...|
|  c|2017-06-13 05:02:...|
|  d|2017-06-13 05:02:...|
|  e|2017-06-13 05:02:...|
|  f|2017-06-13 05:02:...|
|  g|2017-06-13 05:02:...|
|  h|2017-06-13 05:02:...|
|  i|2017-06-13 05:02:...|
|  j|2017-06-13 05:02:...|
+---+--------------------+
&lt;/PRE&gt;&lt;P&gt;Based on the error you have - is the timestamp value in your table a REAL timestamp? How did you insert it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:15:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205957#M62708</guid>
      <dc:creator>dkozlowski</dc:creator>
      <dc:date>2017-06-13T12:15:02Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205958#M62709</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/3675/dkozlowski.html" nodeid="3675"&gt;@Daniel&lt;/A&gt;, the timestamp in my case are real time stamps that are coming from our sensors. As can be seen the timestamp values are 1969-06-1906:57:26.485 and 1988-06-2105:36:22.35 are in my table.&lt;/P&gt;&lt;P&gt;I inserted the data from a pyspark program, code snippet below&lt;/P&gt;&lt;PRE&gt;write_df = final_df.where(col(first_partitioned_column).isin(format(first_partition)))
write_df.drop(first_partitioned_column)
write_df.write.mode("overwrite").format("orc").partitionBy(first_partitioned_column).save(path)&lt;/PRE&gt;&lt;P&gt;One thing I observed was the timestamp column in write_df was of string datatype and not timestamp but then my assumption is that spark will do the cast internally where a dataframe column is string and the table column is of timestamp value.&lt;/P&gt;&lt;P&gt;Another thing to note is from beeline I am able to query the results without any issues.&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:23:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205958#M62709</guid>
      <dc:creator>jayadeep_jayara</dc:creator>
      <dc:date>2017-06-13T12:23:29Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205959#M62710</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/13072/jayadeepjayaraman.html" nodeid="13072"&gt;@Jayadeep Jayaraman&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I have just done another test - treated timestamp as a string. That works for me as well. See below:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;beeline&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;&amp;gt; create table test_orc_t_string (b string,t timestamp) stored as ORC;
&amp;gt; insert into table test_orc_t_string values('a', '1969-06-19 06:57:26.485'),('b','1988-06-21 05:36:22.35');
&amp;gt; select * from test_orc_t_string;
+----------------------+--------------------------+--+
| test_orc_t_string.b  |   test_orc_t_string.t    |
+----------------------+--------------------------+--+
| a                    | 1969-06-19 06:57:26.485  |
| b                    | 1988-06-21 05:36:22.35   |
+----------------------+--------------------------+--+
2 rows selected (0.128 seconds)
&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;pyspark&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; sqlContext.sql("select * from test_orc_t_string").show()
+---+--------------------+
|  b|                   t|
+---+--------------------+
|  a|1969-06-19 06:57:...|
|  b|1988-06-21 05:36:...|
+---+--------------------+
&lt;/PRE&gt;&lt;P&gt;Can you test the above at your site? Let me know how this works.&lt;/P&gt;&lt;P&gt;Can you also send me the output of the below from beeline:&lt;/P&gt;&lt;PRE&gt;show create table test;&lt;/PRE&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:32:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205959#M62710</guid>
      <dc:creator>dkozlowski</dc:creator>
      <dc:date>2017-06-13T12:32:02Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205960#M62711</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/3675/dkozlowski.html" nodeid="3675"&gt;@Daniel Kozlowski&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;I have tested the above case and it works fine...on my end as well. Also, I created a table with the timestamp column as string and then from this temp table I inserted the data into the main table with timestamp datatype and from spark I am able to read the data without any issues.&lt;/P&gt;&lt;P&gt;I guess the issue is when I am inserting data from spark into hive and reading it back.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:43:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205960#M62711</guid>
      <dc:creator>jayadeep_jayara</dc:creator>
      <dc:date>2017-06-13T12:43:12Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205961#M62712</link>
      <description>&lt;P&gt;Also see the below the structure of the dataframe before the write method is called&lt;/P&gt;&lt;PRE&gt;DataFrame[vehicle_hdr: string, vehicle_no: string, incident_timestamp: string]&lt;/PRE&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:52:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205961#M62712</guid>
      <dc:creator>jayadeep_jayara</dc:creator>
      <dc:date>2017-06-13T12:52:45Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205962#M62713</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/13072/jayadeepjayaraman.html" nodeid="13072"&gt;@Jayadeep Jayaraman&lt;/A&gt; &lt;/P&gt;&lt;P&gt;It is good to hear the sample works.&lt;/P&gt;&lt;P&gt;I have a feeling that problem may be with the way you created your original table.&lt;/P&gt;&lt;P&gt;Hence, try another thing - point your code to the test_orc_t_string table - the one from my above sample. Check if that works.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 12:59:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205962#M62713</guid>
      <dc:creator>dkozlowski</dc:creator>
      <dc:date>2017-06-13T12:59:46Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205963#M62714</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/3675/dkozlowski.html" nodeid="3675"&gt;@Daniel Kozlowski&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;I resolved the issue it was because the input table I had defined had string datatype, I used a cast function inside my spark code and now everything is working fine. Thanks for your help.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 14:59:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205963#M62714</guid>
      <dc:creator>jayadeep_jayara</dc:creator>
      <dc:date>2017-06-13T14:59:38Z</dc:date>
    </item>
    <item>
      <title>Re: ORC Table Timestamp PySpark 2.1 CASTIssue</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205964#M62715</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/13072/jayadeepjayaraman.html" nodeid="13072"&gt;@Jayadeep Jayaraman&lt;/A&gt;&lt;/P&gt;&lt;P&gt;That is great - thanks for letting me know&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 15:11:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/ORC-Table-Timestamp-PySpark-2-1-CASTIssue/m-p/205964#M62715</guid>
      <dc:creator>dkozlowski</dc:creator>
      <dc:date>2017-06-13T15:11:41Z</dc:date>
    </item>
  </channel>
</rss>

