Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark 2.1 Hive Partition Adding Issue ORC Format

avatar
Rising Star

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's

<code>create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;

create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;

I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that

<code>for row in  incremental_df.select(partitioned_column).distinct().collect():
    path            = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
    old_df          = merge_df.where(col(partitioned_column).isin(format(row[0])))
    new_df          = incremental_df.where(col(partitioned_column).isin(format(row[0])))
    output_df       = old_df.subtract(new_df)
    output_df       = output_df.unionAll(new_df)
    output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
    refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
    sqlContext.sql(refresh_metadata_sql)

On Execution of the code I am able to see the partitions in HDFS

Found 3 items drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03

But when I am trying to access the table inside Spark I am getting array out of bound error

<code>>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
        at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
        at java.util.ArrayList.subList(ArrayList.java:996)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
        at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:252)
        at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:251)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
1 ACCEPTED SOLUTION

avatar
Rising Star

I resolved it by removing the column on which the table was partitioned from the dataframe

View solution in original post

1 REPLY 1

avatar
Rising Star

I resolved it by removing the column on which the table was partitioned from the dataframe