Support Questions

vergari · ‎02-23-2016

Hi all,

I get an java.lang.IndexOutOfBoundsException while trying to execute a select distinct(...) on a big hive table (about 60 GB).

This is the log of the Tez vertex:

2016-02-23 16:35:03,039 [ERROR] [TezChild] |tez.TezProcessor|: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
    at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
    at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
    at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
    ... 16 more
Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkBounds(Buffer.java:567)
    at java.nio.ByteBuffer.get(ByteBuffer.java:686)
    at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
    at org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:609)
    at org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:569)
    at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:737)
    at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:793)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:853)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
    at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
    at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
    at org.apache.hadoop.hive.ql.exec.Utilities.skipHeader(Utilities.java:3911)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:337)
    ... 22 more

I already tried to disable vectorization and to increment the tez container size, but nothing changed.

If I execute the query on the same table, but with less data inside, all goes right.

Do you already seen this kind of error?

Thank you,

D.

gopalv · ‎02-24-2016

Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs.

Try re-running with

set mapreduce.input.fileinputformat.split.minsize=67108864;

or alternatively, compress the files before loading with gzip with something like this https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1

View solution in original post

gopalv · ‎02-24-2016

This looks awfully like an HDFS bug and entirely unrelated to Tez. The IndexOutOfBounds is thrown from HDFS block local readers.

gopalv · ‎02-24-2016

Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs.

Try re-running with

set mapreduce.input.fileinputformat.split.minsize=67108864;

or alternatively, compress the files before loading with gzip with something like this https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1

lrea_sys · ‎04-04-2016

Hi Davide,

setting

hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat is a good workaround.

😉 Bye

pravat_sutar01 · ‎09-29-2017

sort merge join to false worked fine for me.

hive.auto.convert.sortmerge.join=false

--Pravat Sutar

Cloudera Community

Support Questions

Tez IndexOutOfBoundsException in select distinct

hive select distinct count error

Best way to select distinct values from multiple c...

Understanding Tez Application submission and its f...

Demystify Apache Tez Memory Tuning - Step by Step

Hive on Tez Performance Tuning - Determining Reduc...

HDP Sandbox 2.6 Hive/Hive 2 'SELECT COUNT' or 'SEL...

Simple 'select' query on ORC table runs forever on...

SQL SELECT into Variable

straight SELECT and SELECT via CTE produce differe...

select count(*) fails with tez over cassandra