question Re: Hadoop read IO size in Archives of Support Questions (Read Only)

Hadoop read IO size

fil — Fri, 16 Sep 2022 09:40:13 GMT

Hi dear experts!

i'm curious how it possible to handle read IO size in my MR jobs.

for exampe, i have some file in HDFS, under the hood it's files in Linux filesystem /disk1/hadoop/.../.../blkXXX.

in ideal case this file size should be equal block size (128-256MB).

my question is how it possible to set IO size for reading operation?

thank you!

Re: Hadoop read IO size

Harsh J — Wed, 09 Sep 2015 04:28:31 GMT

Jobs typically read records - not entire blocks. Is your MR job doing anything different in this regard?

Note that HDFS Readers do not read whole blocks of data at a time, and instead stream the data via a buffered read (64k-128k typically). That the block size is X MB does not translate into a memory requirement unless you are explicitly storing the entire block in memory when streaming the read.

Re: Hadoop read IO size

fil — Wed, 09 Sep 2015 17:01:49 GMT

thank you for your reply!
just for clarify
> stream the data via a buffered read
does size of this buffer defined by io.file.buffer.size parameter?

thanks!

Re: Hadoop read IO size

Harsh J — Wed, 09 Sep 2015 23:44:20 GMT

The reader buffer size is indeed controlled by that property
(io.file.buffer.size) but note that if you're doing short circuited reads
then another property that also applies is
(dfs.client.read.shortcircuit.buffer.size, 1 MB in bytes by default).

Re: Hadoop read IO size

fil — Thu, 10 Sep 2015 01:03:40 GMT

thank you for your reply!
Could you point me at source class where it's possible to read this in more details?

thanks!

Re: Hadoop read IO size

Harsh J — Thu, 10 Sep 2015 05:36:18 GMT

Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L294-L303