Created 06-12-2017 07:10 AM
All,
I have a table which has 3 columns and is in ORC format, the data is as below
+--------------+-------------+--------------------------+--+ | vehicle_hdr | vehicle_no | incident_timestamp | +--------------+-------------+--------------------------+--+ | XXXX | 3911 | 1969-06-19 06:57:26.485 | | XXXX | 3911 | 1988-06-21 05:36:22.35 |
The DDL for the table is as below
create table test (vehicle_hdr string,vehicle_no string,incident_timestamp timestamp)stored as ORC;
From the hive beeline I am able to view the results but when I am using PySpark 2.1 and running the below code
o1 = sqlContext.sql("select vehicle_hdr, incident_timestamp from test")
I am getting the below error
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:393) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:392) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:416) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:408) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
Created 06-13-2017 05:15 AM
I have just tested the same in pyspark2.1. That works fine my site. See below:
beeline
0: jdbc:hive2://dkhdp262.openstacklocal:2181,> create table test_orc (b string,t timestamp) stored as ORC; 0: jdbc:hive2://dkhdp262.openstacklocal:2181,> select * from test_orc; +-------------+------------------------+--+ | test_orc.b | test_orc.t | +-------------+------------------------+--+ | a | 2017-06-13 05:02:23.0 | | b | 2017-06-13 05:02:23.0 | | c | 2017-06-13 05:02:23.0 | | d | 2017-06-13 05:02:23.0 | | e | 2017-06-13 05:02:23.0 | | f | 2017-06-13 05:02:23.0 | | g | 2017-06-13 05:02:23.0 | | h | 2017-06-13 05:02:23.0 | | i | 2017-06-13 05:02:23.0 | | j | 2017-06-13 05:02:23.0 | +-------------+------------------------+--+ 10 rows selected (0.091 seconds)
pyspark
[root@dkhdp262 ~]# export SPARK_MAJOR_VERSION=2 [root@dkhdp262 ~]# pyspark SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.5 (default, Jun 17 2014, 18:11:42) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.1.0-129 /_/ Using Python version 2.7.5 (default, Jun 17 2014 18:11:42) SparkSession available as 'spark'. >>> sqlContext.sql("select b, t from test_orc").show() +---+--------------------+ | b| t| +---+--------------------+ | a|2017-06-13 05:02:...| | b|2017-06-13 05:02:...| | c|2017-06-13 05:02:...| | d|2017-06-13 05:02:...| | e|2017-06-13 05:02:...| | f|2017-06-13 05:02:...| | g|2017-06-13 05:02:...| | h|2017-06-13 05:02:...| | i|2017-06-13 05:02:...| | j|2017-06-13 05:02:...| +---+--------------------+
Based on the error you have - is the timestamp value in your table a REAL timestamp? How did you insert it?
Created 06-13-2017 05:15 AM
I have just tested the same in pyspark2.1. That works fine my site. See below:
beeline
0: jdbc:hive2://dkhdp262.openstacklocal:2181,> create table test_orc (b string,t timestamp) stored as ORC; 0: jdbc:hive2://dkhdp262.openstacklocal:2181,> select * from test_orc; +-------------+------------------------+--+ | test_orc.b | test_orc.t | +-------------+------------------------+--+ | a | 2017-06-13 05:02:23.0 | | b | 2017-06-13 05:02:23.0 | | c | 2017-06-13 05:02:23.0 | | d | 2017-06-13 05:02:23.0 | | e | 2017-06-13 05:02:23.0 | | f | 2017-06-13 05:02:23.0 | | g | 2017-06-13 05:02:23.0 | | h | 2017-06-13 05:02:23.0 | | i | 2017-06-13 05:02:23.0 | | j | 2017-06-13 05:02:23.0 | +-------------+------------------------+--+ 10 rows selected (0.091 seconds)
pyspark
[root@dkhdp262 ~]# export SPARK_MAJOR_VERSION=2 [root@dkhdp262 ~]# pyspark SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.5 (default, Jun 17 2014, 18:11:42) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.1.0-129 /_/ Using Python version 2.7.5 (default, Jun 17 2014 18:11:42) SparkSession available as 'spark'. >>> sqlContext.sql("select b, t from test_orc").show() +---+--------------------+ | b| t| +---+--------------------+ | a|2017-06-13 05:02:...| | b|2017-06-13 05:02:...| | c|2017-06-13 05:02:...| | d|2017-06-13 05:02:...| | e|2017-06-13 05:02:...| | f|2017-06-13 05:02:...| | g|2017-06-13 05:02:...| | h|2017-06-13 05:02:...| | i|2017-06-13 05:02:...| | j|2017-06-13 05:02:...| +---+--------------------+
Based on the error you have - is the timestamp value in your table a REAL timestamp? How did you insert it?
Created 06-13-2017 05:23 AM
Thanks @Daniel, the timestamp in my case are real time stamps that are coming from our sensors. As can be seen the timestamp values are 1969-06-1906:57:26.485 and 1988-06-2105:36:22.35 are in my table.
I inserted the data from a pyspark program, code snippet below
write_df = final_df.where(col(first_partitioned_column).isin(format(first_partition))) write_df.drop(first_partitioned_column) write_df.write.mode("overwrite").format("orc").partitionBy(first_partitioned_column).save(path)
One thing I observed was the timestamp column in write_df was of string datatype and not timestamp but then my assumption is that spark will do the cast internally where a dataframe column is string and the table column is of timestamp value.
Another thing to note is from beeline I am able to query the results without any issues.
Thanks in advance.
Created 06-13-2017 05:32 AM
I have just done another test - treated timestamp as a string. That works for me as well. See below:
beeline
> create table test_orc_t_string (b string,t timestamp) stored as ORC; > insert into table test_orc_t_string values('a', '1969-06-19 06:57:26.485'),('b','1988-06-21 05:36:22.35'); > select * from test_orc_t_string; +----------------------+--------------------------+--+ | test_orc_t_string.b | test_orc_t_string.t | +----------------------+--------------------------+--+ | a | 1969-06-19 06:57:26.485 | | b | 1988-06-21 05:36:22.35 | +----------------------+--------------------------+--+ 2 rows selected (0.128 seconds)
pyspark
>>> sqlContext.sql("select * from test_orc_t_string").show() +---+--------------------+ | b| t| +---+--------------------+ | a|1969-06-19 06:57:...| | b|1988-06-21 05:36:...| +---+--------------------+
Can you test the above at your site? Let me know how this works.
Can you also send me the output of the below from beeline:
show create table test;
Created 06-13-2017 05:43 AM
I have tested the above case and it works fine...on my end as well. Also, I created a table with the timestamp column as string and then from this temp table I inserted the data into the main table with timestamp datatype and from spark I am able to read the data without any issues.
I guess the issue is when I am inserting data from spark into hive and reading it back.
Created 06-13-2017 05:52 AM
Also see the below the structure of the dataframe before the write method is called
DataFrame[vehicle_hdr: string, vehicle_no: string, incident_timestamp: string]
Created 06-13-2017 05:59 AM
It is good to hear the sample works.
I have a feeling that problem may be with the way you created your original table.
Hence, try another thing - point your code to the test_orc_t_string table - the one from my above sample. Check if that works.
Created 06-13-2017 07:59 AM
I resolved the issue it was because the input table I had defined had string datatype, I used a cast function inside my spark code and now everything is working fine. Thanks for your help.
Created 06-13-2017 08:11 AM
That is great - thanks for letting me know