Created 06-01-2017 04:35 PM
I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's
<code>create table A ( objid bigint, occur_date timestamp) STORED AS ORC; create table B ( objid bigint, occur_date timestamp) PARTITIONED BY ( occur_date_pt date) STORED AS ORC;
I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that
<code>for row in incremental_df.select(partitioned_column).distinct().collect(): path = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0]) old_df = merge_df.where(col(partitioned_column).isin(format(row[0]))) new_df = incremental_df.where(col(partitioned_column).isin(format(row[0]))) output_df = old_df.subtract(new_df) output_df = output_df.unionAll(new_df) output_df.write.option("compression","none").mode("overwrite").format("orc").save(path) refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name sqlContext.sql(refresh_metadata_sql)
On Execution of the code I am able to see the partitions in HDFS
Found 3 items drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03
But when I am trying to access the table inside Spark I am getting array out of bound error
<code>>> merge_df = sqlContext.sql('select * from B') DataFrame[] >>> merge_df.show() 17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827) java.lang.IndexOutOfBoundsException: toIndex = 3 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:252) at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:251) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
Created 06-12-2017 07:05 AM
I resolved it by removing the column on which the table was partitioned from the dataframe
Created 06-12-2017 07:05 AM
I resolved it by removing the column on which the table was partitioned from the dataframe