Support Questions

Find answers, ask questions, and share your expertise

Write performance in HDFS

avatar
New Contributor

Hi,

I am new to Hadoop ecosystem. If my questions sound trivial, please forgive my ignorance.

1. I read that HDFS has a default replication factor 3, meaning whenever a file is written, each block of the file is stored 3 times. Writing same block 3 times needs more I/O compared to writing once. How does Hadoop address this ? won't it be a problem when writing large datasets ?

2. As HDFS is a Virtual Filesystem, the data it stores will ultimately stored on Underlying Operating system (Most cases, Linux). Assuming Linux has Ext3 File system (whose block size is 4KB), How does having 64MB/128 MB Block size in HDFS help ? Does the 64 MB Block in HDFS will be split into x * 4 KB Blocks of underlying Opertating system ?

Thanks,

RK.

1 ACCEPTED SOLUTION

avatar
Master Guru

"1. I read that HDFS has a default replication factor 3, meaning whenever a file is written, each block of the file is stored 3 times. Writing same block 3 times needs more I/O compared to writing once. How does Hadoop address this ? won't it be a problem when writing large datasets ?"

In addition to what Kuldeep said:

It is true that Hadoop has 3 times the IO in total compared to writing a file to a single file system. However a typical small cluster ( 10 datanodes each with 12 discs ) has 120 discs to write to and therefore roughly 40x the writing capacity and 120 times the reading capacity of a single disc. So even on a small cluster the IO throughput of HDFS is gigantic.

"2. As HDFS is a Virtual Filesystem, the data it stores will ultimately stored on Underlying Operating system (Most cases, Linux). Assuming Linux has Ext3 File system (whose block size is 4KB), How does having 64MB/128 MB Block size in HDFS help ? Does the 64 MB Block in HDFS will be split into x * 4 KB Blocks of underlying Opertating system ?"

Again small additional comment to what Kuldeep said. A block is just a Linux file so yes all the underlying details of ext3 or whatever still apply. The reason blocks are so big is not because of the storage on the local node but because of the central FS storage

- To have a distributed filesystem you need to have a central FS Image that keeps track of ALL the blocks on ALL the servers in the system. In HDFS this is the Namenode. He keeps track of all the files in HDFS and all the blocks on all the datanodes. This needs to be in memory to support the high number of concurrent tasks and operations happening in an hadoop cluster so the traditional ways of setting this up ( btrees ) doesn't really work. It also needs to cover all nodes on all discs so he needs to keep track of thousands of nodes with tens of thousands of drives. Doing that with 4kb blocks wouldn't scale so the block size is around 128MB on most systems. ( You can count roughly 1GB of Namenode memory for 100TB of data )

- If you want to process a huge file in HDFS you need to run a parallel task on it ( MapReduce, Tez, Spark , ... ) In this case each task gets one block of data and reads it. It might be local or not. Reading a big 128 MB block or sending him over the network is efficient. Doing the same with 30000 4KB files would be very inefficient.

That is the reason for the block size.

View solution in original post

4 REPLIES 4

avatar
Master Guru
@R K

Good question!

1. Regarding I/O - Block wont get written on single node instead it gets replicated to other 2 datanodes, NN doesn't take resposibility of replication to avoid an extra overhead, it indeed gives location other other 2 datanodes and datanode continues this chain of replication, for each replication there is an acknowledgement. Read more about HDFS write anatomy here.

2. Hadoop is designed to process BigData hence having files with small size wont give us much benefit. That's correct ext3 filesystem on Linux has block size of 4KB. When MapReduce program/Hadoop Client/Any other Application tries to read any file from HDFS, block is the basic unit.

Regarding "Does the 64 MB Block in HDFS will be split into x * 4 KB Blocks of underlying Opertating system ?"

Logically speaking, 1 HDFS block corresponds to 1 file in local file system on datanode, if you do stat command on that file, you should get block related info from underlying FS.

avatar
Master Guru

"1. I read that HDFS has a default replication factor 3, meaning whenever a file is written, each block of the file is stored 3 times. Writing same block 3 times needs more I/O compared to writing once. How does Hadoop address this ? won't it be a problem when writing large datasets ?"

In addition to what Kuldeep said:

It is true that Hadoop has 3 times the IO in total compared to writing a file to a single file system. However a typical small cluster ( 10 datanodes each with 12 discs ) has 120 discs to write to and therefore roughly 40x the writing capacity and 120 times the reading capacity of a single disc. So even on a small cluster the IO throughput of HDFS is gigantic.

"2. As HDFS is a Virtual Filesystem, the data it stores will ultimately stored on Underlying Operating system (Most cases, Linux). Assuming Linux has Ext3 File system (whose block size is 4KB), How does having 64MB/128 MB Block size in HDFS help ? Does the 64 MB Block in HDFS will be split into x * 4 KB Blocks of underlying Opertating system ?"

Again small additional comment to what Kuldeep said. A block is just a Linux file so yes all the underlying details of ext3 or whatever still apply. The reason blocks are so big is not because of the storage on the local node but because of the central FS storage

- To have a distributed filesystem you need to have a central FS Image that keeps track of ALL the blocks on ALL the servers in the system. In HDFS this is the Namenode. He keeps track of all the files in HDFS and all the blocks on all the datanodes. This needs to be in memory to support the high number of concurrent tasks and operations happening in an hadoop cluster so the traditional ways of setting this up ( btrees ) doesn't really work. It also needs to cover all nodes on all discs so he needs to keep track of thousands of nodes with tens of thousands of drives. Doing that with 4kb blocks wouldn't scale so the block size is around 128MB on most systems. ( You can count roughly 1GB of Namenode memory for 100TB of data )

- If you want to process a huge file in HDFS you need to run a parallel task on it ( MapReduce, Tez, Spark , ... ) In this case each task gets one block of data and reads it. It might be local or not. Reading a big 128 MB block or sending him over the network is efficient. Doing the same with 30000 4KB files would be very inefficient.

That is the reason for the block size.

avatar

And just a few more relevant facts:

1. The write pipeline for replication is parallelized in chunks, so the time to write an HDFS block with 3x replication is NOT 3x (write time on one datanode), but rather 1x (write time on one datanode) + 2x (delta), where "delta" is approximately the time to transmit and write one chunk. Where a block is 128 or 256 MB, a chunk is something like 64KB if I recall correctly. If your network between datanodes is at least 1Gbps then the time for delta is dominated by the disk write speed.

2. The last block of an HDFS file is typically a "short" block, since files aren't exact multiples of 128MB. HDFS only takes up as much of the native file system storage as needed (quantized by the native file system block size, typically 8KB), and does NOT take up the full 128MB block size for the final block of each file.

avatar
New Contributor

Thank you all for the response.