Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop read IO size

Solved Go to solution
Highlighted

Hadoop read IO size

Rising Star

Hi dear experts!

 

i'm curious how it possible to handle read IO size in my MR jobs.

for exampe, i have some file in HDFS, under the hood it's files in Linux filesystem /disk1/hadoop/.../.../blkXXX.

in ideal case this file size should be equal block size (128-256MB).

my question is how it possible to set IO size for reading operation?

 

thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Hadoop read IO size

Master Guru
Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...

View solution in original post

5 REPLIES 5
Highlighted

Re: Hadoop read IO size

Master Guru
Jobs typically read records - not entire blocks. Is your MR job doing anything different in this regard?

Note that HDFS Readers do not read whole blocks of data at a time, and instead stream the data via a buffered read (64k-128k typically). That the block size is X MB does not translate into a memory requirement unless you are explicitly storing the entire block in memory when streaming the read.
Highlighted

Re: Hadoop read IO size

Rising Star
thank you for your reply!
just for clarify
> stream the data via a buffered read
does size of this buffer defined by io.file.buffer.size parameter?

thanks!
Highlighted

Re: Hadoop read IO size

Master Guru
The reader buffer size is indeed controlled by that property
(io.file.buffer.size) but note that if you're doing short circuited reads
then another property that also applies is
(dfs.client.read.shortcircuit.buffer.size, 1 MB in bytes by default).

Re: Hadoop read IO size

Rising Star
thank you for your reply!
Could you point me at source class where it's possible to read this in more details?

thanks!
Highlighted

Re: Hadoop read IO size

Master Guru
Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/...

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here