Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark 2.1 Hive Partition Adding Issue ORC Format

avatar
Rising Star

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's

<code>create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;

create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;

I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that

<code>for row in  incremental_df.select(partitioned_column).distinct().collect():
    path            = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
    old_df          = merge_df.where(col(partitioned_column).isin(format(row[0])))
    new_df          = incremental_df.where(col(partitioned_column).isin(format(row[0])))
    output_df       = old_df.subtract(new_df)
    output_df       = output_df.unionAll(new_df)
    output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
    refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
    sqlContext.sql(refresh_metadata_sql)

On Execution of the code I am able to see the partitions in HDFS

Found 3 items drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03

But when I am trying to access the table inside Spark I am getting array out of bound error

<code>>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
        at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
        at java.util.ArrayList.subList(ArrayList.java:996)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
        at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:252)
        at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:251)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
1 ACCEPTED SOLUTION

avatar
Rising Star

I resolved it by removing the column on which the table was partitioned from the dataframe

View solution in original post

1 REPLY 1

avatar
Rising Star

I resolved it by removing the column on which the table was partitioned from the dataframe