Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark 2 -- Reading Orc files using spark session througing exception IndexOutOfBoundsException

Spark 2 -- Reading Orc files using spark session througing exception IndexOutOfBoundsException

Contributor

Hi,

good evening,

I am trying to execute the following commands to load ORC file into df

sudo -u spark ./spark-shell Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/10/22 04:19:49 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://
Spark context available as 'sc' (master = local[*], app id = local-1477109984751).
Spark session available as 'spark'.
Welcome to
  ____  __
  / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
  /___/ .__/\_,_/_/ /_/\_\  version 2.0.0.2.5.0.0-1245
  /_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.scala> import org.apache.spark.sql.SparkSession scala> val spark = SparkSession.builder().appName("SparkSessionOrcExample").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()16/10/22 04:23:14 WARN SparkSession$Builder: Use an existing SparkSession, some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@41ec4271scala> val tblaccount = spark.read.orc("/apps/hive/warehouse/enrollment_full.db/account")
java.lang.IndexOutOfBoundsException
  at java.nio.Buffer.checkIndex(Buffer.java:540)
  at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
  at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:374)
  at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:316)
  at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:187)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:68)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:67)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.TraversableOnce$class.collectFirst(TraversableOnce.scala:145)
  at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1336)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:69)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
  at org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:61)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:392)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:392)
  at scala.Option.orElse(Option.scala:289)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:391)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:450)
  at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:439)
  ... 48 elided

account is imported using Sqoop import in ORC format. It is working fine in one case (error table) where as it is failing other cases with the above exception

The following is the directory structure

$ hadoop fs -ls /apps/hive/warehouse/enrollment_full.db/account
Found 5 items-rw-rw-rw-  3 mysqoop hdfs  0 2016-10-21 22:27 /apps/hive/warehouse/enrollment_full.db/account/part-m-00000-rw-rw-rw-  3 mysqoop hdfs  385758 2016-10-21 22:27 /apps/hive/warehouse/enrollment_full.db/account/part-m-00001
-rw-rw-rw-  3 mysqoop hdfs  4285 2016-10-21 22:27 /apps/hive/warehouse/enrollment_full.db/account/part-m-00002
-rw-rw-rw-  3 mysqoop hdfs  3499 2016-10-21 22:27 /apps/hive/warehouse/enrollment_full.db/account/part-m-00003
-rw-rw-rw-  3 mysqoop hdfs  4226 2016-10-21 22:27 /apps/hive/warehouse/enrollment_full.db/account/part-m-00004

I am able to read the following using

val tblerror = spark.read.orc("/apps/hive/warehouse/enrollment_full.db/error")

where the account is giving error.

$ hadoop fs -ls /apps/hive/warehouse/enrollment_full.db/error 

Found 4 items 

-rw-rw-rw-  3 mysqoop hdfs  10094063 2016-10-07 22:04 /apps/hive/warehouse/enrollment_full.db/error/part-m-00000

-rw-rw-rw-  3 mysqoop hdfs  3781085 2016-10-07 22:04 /apps/hive/warehouse/enrollment_full.db/error/part-m-00001 

-rw-rw-rw-  3 mysqoop hdfs  4333343 2016-10-07 22:04 /apps/hive/warehouse/enrollment_full.db/error/part-m-00002 

-rw-rw-rw-  3 mysqoop hdfs  6345381 2016-10-07 22:04 /apps/hive/warehouse/enrollment_full.db/error/part-m-00003

Any help or pointer to fix this issue is much appreciated.

I am thinking the issue is part-m-00000 under account as this has zero byte size.

My question is : How can we avoid creating zero byte files using Sqoop and how can we set a flag in spark to ignore these zero byte files while reading the ORC files.

Thanks

Ram

2 REPLIES 2

Re: Spark 2 -- Reading Orc files using spark session througing exception IndexOutOfBoundsException

Super Guru

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/orc-spark.h...

Why don't you create a hive table on top of the ORC file and see if you can query and see if there are any errors?

http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/

If you store it in the Hive Table area, there better be a table ontop of it.

/apps/hive/warehouse/enrollment_full.db/account

So just use SparkSQL to read that table, faster and easier to work with in Spark anyway. SparkSQL is great.

Highlighted

Re: Spark 2 -- Reading Orc files using spark session througing exception IndexOutOfBoundsException

New Contributor

I have the same problem, even if I use hive tables. The problem doesn't exists on hive. I found issue for this https://issues.apache.org/jira/browse/SPARK-19809

Don't have an account?
Coming from Hortonworks? Activate your account here