About jayadeep_jayara

jayadeep_jayara · ‎06-13-2017

Hi @Daniel Kozlowski, I resolved the issue it was because the input table I had defined had string datatype, I used a cast function inside my spark code and now everything is working fine. Thanks for your help.

jayadeep_jayara · ‎06-13-2017

Also see the below the structure of the dataframe before the write method is called DataFrame[vehicle_hdr: string, vehicle_no: string, incident_timestamp: string]

jayadeep_jayara · ‎06-13-2017

Hi @Daniel Kozlowski, I have tested the above case and it works fine...on my end as well. Also, I created a table with the timestamp column as string and then from this temp table I inserted the data into the main table with timestamp datatype and from spark I am able to read the data without any issues. I guess the issue is when I am inserting data from spark into hive and reading it back.

jayadeep_jayara · ‎06-13-2017

Thanks @Daniel, the timestamp in my case are real time stamps that are coming from our sensors. As can be seen the timestamp values are 1969-06-1906:57:26.485 and 1988-06-2105:36:22.35 are in my table. I inserted the data from a pyspark program, code snippet below write_df = final_df.where(col(first_partitioned_column).isin(format(first_partition))) write_df.drop(first_partitioned_column) write_df.write.mode("overwrite").format("orc").partitionBy(first_partitioned_column).save(path) One thing I observed was the timestamp column in write_df was of string datatype and not timestamp but then my assumption is that spark will do the cast internally where a dataframe column is string and the table column is of timestamp value. Another thing to note is from beeline I am able to query the results without any issues. Thanks in advance.

jayadeep_jayara · ‎06-12-2017

All, I have a table which has 3 columns and is in ORC format, the data is as below +--------------+-------------+--------------------------+--+ | vehicle_hdr | vehicle_no | incident_timestamp | +--------------+-------------+--------------------------+--+ | XXXX | 3911 | 1969-06-19 06:57:26.485 | | XXXX | 3911 | 1988-06-21 05:36:22.35 | The DDL for the table is as below create table test (vehicle_hdr string,vehicle_no string,incident_timestamp timestamp)stored as ORC; From the hive beeline I am able to view the results but when I am using PySpark 2.1 and running the below code o1 = sqlContext.sql("select vehicle_hdr, incident_timestamp from test") I am getting the below error Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:393) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$14$anonfun$apply$11.apply(TableReader.scala:392) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:416) at org.apache.spark.sql.hive.HadoopTableReader$anonfun$fillObject$2.apply(TableReader.scala:408) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$anon$11.next(Iterator.scala:328)

jayadeep_jayara · ‎06-12-2017

I resolved it by removing the column on which the table was partitioned from the dataframe

jayadeep_jayara · ‎06-01-2017

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's <code>create table A ( objid bigint, occur_date timestamp) STORED AS ORC; create table B ( objid bigint, occur_date timestamp) PARTITIONED BY ( occur_date_pt date) STORED AS ORC; I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that <code>for row in incremental_df.select(partitioned_column).distinct().collect(): path = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0]) old_df = merge_df.where(col(partitioned_column).isin(format(row[0]))) new_df = incremental_df.where(col(partitioned_column).isin(format(row[0]))) output_df = old_df.subtract(new_df) output_df = output_df.unionAll(new_df) output_df.write.option("compression","none").mode("overwrite").format("orc").save(path) refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name sqlContext.sql(refresh_metadata_sql) On Execution of the code I am able to see the partitions in HDFS Found 3 items drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03 But when I am trying to access the table inside Spark I am getting array out of bound error <code>>> merge_df = sqlContext.sql('select * from B') DataFrame[] >>> merge_df.show() 17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827) java.lang.IndexOutOfBoundsException: toIndex = 3 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:252) at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:251) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

jayadeep_jayara · ‎03-01-2017

Thanks @Jay SenSharma

jayadeep_jayara · ‎03-01-2017

Hi All, I have installed LLAP in my HDP 2.5 cluster. I believe the Hive View in Ambari by default uses hive on tez, how can I make it use HIVE LLAP ? Thanks, Jayadeep

jayadeep_jayara · ‎02-27-2017

I resolved it, by adding few extra lines in the rewrite rule to handle websocket interaction RewriteRule ^/ws(.*)$ ws://localhost:9995/ws [P]

Online	Offline
Last Visited	‎03-05-2018 01:41 PM

Member Since	‎09-13-2016 03:55 PM
Last Visited	‎03-05-2018 01:41 PM
Posts	31
Kudos received	5

Cloudera Community

Re: Spark 2.1 Hive Partition Adding Issue ORC Form...

Re: Zeppelin Home Page Blank

Re: ORC Table Timestamp PySpark 2.1 CASTIssue

Re: ORC Table Timestamp PySpark 2.1 CASTIssue

Re: ORC Table Timestamp PySpark 2.1 CASTIssue

Re: ORC Table Timestamp PySpark 2.1 CASTIssue

ORC Table Timestamp PySpark 2.1 CASTIssue

Re: Spark 2.1 Hive Partition Adding Issue ORC Form...

Spark 2.1 Hive Partition Adding Issue ORC Format

Re: Hive LLAP in Hive View Ambari

Hive LLAP in Hive View Ambari

Re: Zeppelin Home Page Blank