Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop read IO size

avatar
Rising Star

Hi dear experts!

 

i'm curious how it possible to handle read IO size in my MR jobs.

for exampe, i have some file in HDFS, under the hood it's files in Linux filesystem /disk1/hadoop/.../.../blkXXX.

in ideal case this file size should be equal block size (128-256MB).

my question is how it possible to set IO size for reading operation?

 

thank you!

1 ACCEPTED SOLUTION

avatar
Mentor
Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...

View solution in original post

5 REPLIES 5

avatar
Mentor
Jobs typically read records - not entire blocks. Is your MR job doing anything different in this regard?

Note that HDFS Readers do not read whole blocks of data at a time, and instead stream the data via a buffered read (64k-128k typically). That the block size is X MB does not translate into a memory requirement unless you are explicitly storing the entire block in memory when streaming the read.

avatar
Rising Star
thank you for your reply!
just for clarify
> stream the data via a buffered read
does size of this buffer defined by io.file.buffer.size parameter?

thanks!

avatar
Mentor
The reader buffer size is indeed controlled by that property
(io.file.buffer.size) but note that if you're doing short circuited reads
then another property that also applies is
(dfs.client.read.shortcircuit.buffer.size, 1 MB in bytes by default).

avatar
Rising Star
thank you for your reply!
Could you point me at source class where it's possible to read this in more details?

thanks!

avatar
Mentor
Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...