Support Questions

Find answers, ask questions, and share your expertise

Impala's log "ORC read request to already read range" messages

avatar
Explorer

Hi All

Under CDP 7.3.1, our Impala's log is flooding with messages like

ORC read request to already read range. Falling back to readRandom. offset: 16248424 length: 105200 colrange_offset: 15988393 colrange_length: 365231 colrange_pos: 16353624 typeId: 94 kind: data filename: .......

from queries reading ACID ORC tables created from Hive.

We've been searching about that and the only thing we've found is the code that triggers that message.

void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, uint64_t length,
    uint64_t offset) {
  Status status;
  if (scanner_->IsInFooterRange(offset, length)) {
    status = scanner_->ReadFooterStream(buf, length, offset);
  } else {
    ColumnRange* columnRange = scanner_->FindColumnRange(length, offset);
    if (columnRange == nullptr) {
      status = readRandom(buf, length, offset);
    } else if (offset < columnRange->current_position_) {
      VLOG_QUERY << Substitute(
          "ORC read request to already read range. Falling back to readRandom. "
          "offset: $0 length: $1 $2",
          offset, length, columnRange->debug());
      status = readRandom(buf, length, offset);
    } else {
      status = columnRange->read(buf, length, offset);
    }
  }
  if (!status.ok()) throw ResourceError(status);
}

I'd like to know the performance implications of those messages, if any (the messages are categorized as info), and in affirmative case, what we can do to solve this.

Thanks in advance for your help and time.

Regards.

1 REPLY 1

avatar
Master Collaborator

@AEAT The log message "ORC read request to already read range. Falling back to readRandom" is a sign of a suboptimal read pattern. While not a fatal error, it means Impala is not reading the ORC file as efficiently as it could.

Impala's ORC scanner is designed to read data in a sequential, read-ahead fashion to optimize I/O from HDFS. It attempts to predict what data a query will need next and reads it in large, efficient chunks.

-> Random reads are slower than sequential reads on both spinning disks and SSDs.

-> The process of seeking to a different location in the file and reading a small chunk of data consumes more CPU resources.

-> The cumulative effect of these inefficient reads can add significant time to a query's execution, especially for large datasets.

 

The most common cause of this issue is a large number of small files. Impala has to make many I/O requests to process each file, which can disrupt the efficient read pattern. Please check if you have such pattren of files and compress it as per the hdfs block size.

Manually monitor the resources usage while running the query.

 

Regards,

Chethan YM