Support Questions

fil · ‎09-08-2015

Hi dear experts!

i'm curious how it possible to handle read IO size in my MR jobs.

for exampe, i have some file in HDFS, under the hood it's files in Linux filesystem /disk1/hadoop/.../.../blkXXX.

in ideal case this file size should be equal block size (128-256MB).

my question is how it possible to set IO size for reading operation?

thank you!

Harsh J · ‎09-09-2015

Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...

View solution in original post

Harsh J · ‎09-08-2015

Jobs typically read records - not entire blocks. Is your MR job doing anything different in this regard?

Note that HDFS Readers do not read whole blocks of data at a time, and instead stream the data via a buffered read (64k-128k typically). That the block size is X MB does not translate into a memory requirement unless you are explicitly storing the entire block in memory when streaming the read.

fil · ‎09-09-2015

thank you for your reply!
just for clarify
> stream the data via a buffered read
does size of this buffer defined by io.file.buffer.size parameter?

thanks!

Harsh J · ‎09-09-2015

The reader buffer size is indeed controlled by that property
(io.file.buffer.size) but note that if you're doing short circuited reads
then another property that also applies is
(dfs.client.read.shortcircuit.buffer.size, 1 MB in bytes by default).

fil · ‎09-09-2015

thank you for your reply!
Could you point me at source class where it's possible to read this in more details?

thanks!

Harsh J · ‎09-09-2015

Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...

Cloudera Community

Support Questions

Hadoop read IO size