Archives of Support Questions (Read Only)

jayadeep_jayara · ‎06-01-2017

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's

<code>create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;

create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;

I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that

<code>for row in  incremental_df.select(partitioned_column).distinct().collect():
    path            = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
    old_df          = merge_df.where(col(partitioned_column).isin(format(row[0])))
    new_df          = incremental_df.where(col(partitioned_column).isin(format(row[0])))
    output_df       = old_df.subtract(new_df)
    output_df       = output_df.unionAll(new_df)
    output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
    refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
    sqlContext.sql(refresh_metadata_sql)

On Execution of the code I am able to see the partitions in HDFS

Found 3 items drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02 drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03

But when I am trying to access the table inside Spark I am getting array out of bound error

<code>>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
        at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
        at java.util.ArrayList.subList(ArrayList.java:996)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
        at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
        at org.apache.spark.rdd.HadoopRDD$anon$1.liftedTree1$1(HadoopRDD.scala:252)
        at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:251)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

jayadeep_jayara · ‎06-12-2017

I resolved it by removing the column on which the table was partitioned from the dataframe

View solution in original post

jayadeep_jayara · ‎06-12-2017

I resolved it by removing the column on which the table was partitioned from the dataframe

Cloudera Community

Archives of Support Questions (Read Only)

Spark 2.1 Hive Partition Adding Issue ORC Format