Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Tez IndexOutOfBoundsException in select distinct

avatar
Expert Contributor

Hi all,

I get an java.lang.IndexOutOfBoundsException while trying to execute a select distinct(...) on a big hive table (about 60 GB).

This is the log of the Tez vertex:

2016-02-23 16:35:03,039 [ERROR] [TezChild] |tez.TezProcessor|: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
    ... 16 more
Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkBounds(Buffer.java:567)
    at java.nio.ByteBuffer.get(ByteBuffer.java:686)
    at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
    at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:609)
    at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:569)
    at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:737)
    at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:793)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:853)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
    at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
    at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
    at org.apache.hadoop.hive.ql.exec.Utilities.skipHeader(Utilities.java:3911)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:337)
    ... 22 more

I already tried to disable vectorization and to increment the tez container size, but nothing changed.

If I execute the query on the same table, but with less data inside, all goes right.

Do you already seen this kind of error?

Thank you,

D.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs.

Try re-running with

set mapreduce.input.fileinputformat.split.minsize=67108864;
or alternatively, compress the files before loading with gzip with something like this https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

This looks awfully like an HDFS bug and entirely unrelated to Tez. The IndexOutOfBounds is thrown from HDFS block local readers.

avatar
Expert Contributor

Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs.

Try re-running with

set mapreduce.input.fileinputformat.split.minsize=67108864;
or alternatively, compress the files before loading with gzip with something like this https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1

avatar
New Contributor

Hi Davide,

setting

hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat is a good workaround.

😉 Bye

avatar

sort merge join to false worked fine for me.

hive.auto.convert.sortmerge.join=false

--Pravat Sutar