Support Questions

Find answers, ask questions, and share your expertise

HDFS File Placement when File Size Exceeds Block Size

avatar
Rising Star

When writing files larger than one block, how are blocks distributed across the datanodes? Documentation seems to indicate large files are split across datanodes (whenever possible), but I'm not sure this is always the case.

1 ACCEPTED SOLUTION

avatar
Contributor

The block placement in HDFS depends on a few things:

If the Client application is on a DataNode machine (e.g. a Pig script running on a node in the cluster), then HDFS will attempt to write all the first-replica blocks to that DataNode - because it is the "closest". Some blocks may get written to other DataNodes, for example if the first DataNode is full. Second-replica and third-replica (etc.) blocks get written randomly to multiple DataNodes according to the rack-aware block-placement policy.

If the Client is NOT on a DataNode machine, then all the first-replica blocks get written randomly to a DataNode in the same rack. Second-replica etc. blocks get written to random DataNodes as above.

If the Client is WebHDFS, then all the first-replica blocks get written to one DataNode (this is a limitation of the way WebHDFS works: the NameNode will only give the WebHDFS client one DataNode to write to). This can be a problem when writing files larger than a single disk. Second-replica etc. blocks get written to random DataNodes as above.

View solution in original post

7 REPLIES 7

avatar
Master Guru

@jeden when a file exceeds a block size, NN will try and put the next split in another data node within the same rack. Please read replica placement here.

avatar
Master Guru

@jeden NN may also put the next split on the same data node.

avatar
Master Guru

Are you sure? I was pretty sure that even big files have their first copy written locally.

avatar
Master Guru

@Benjamin Leonhardi that is if the client is located on a specific data node. on a edge node (not associated with data node) NN will try and colocate the splits but I don't believe it can guarantee same node. I believe it will be at the very minimum same rack.

avatar
Master Guru

Normally they are not. Hadoop tries to write the first copy of every block locally if possible ( this has a lot of nice characteristics if the same nodes normally read the files they write ( for example HBase Region Servers. )

So the general rule is:

One client writes a big file with 10 blocks:

- HDFS tries to write a first version of each block to the local datanode if the uploading client is colocated with a datanode in rack1 ( otherwise any node is used)

- While the block is written the datanode initiates a copy to another node in a different rack ( rack2 )

- while that block is written the second datanode initiates a copy to another node in the same rack as itself ( rack2 )

So you essentially have a copy chain, however not all three writes need to finish successfully. If at least the first copy is written successfully the block is assumed as written and the client will start writing the second block and the third and so on.

But the first copy of each will per default end up on the same datanode colocated with the client. As said this has a lot of good results in practice.

However all of that goes out the window when the datanode should run full.

avatar
Contributor

The block placement in HDFS depends on a few things:

If the Client application is on a DataNode machine (e.g. a Pig script running on a node in the cluster), then HDFS will attempt to write all the first-replica blocks to that DataNode - because it is the "closest". Some blocks may get written to other DataNodes, for example if the first DataNode is full. Second-replica and third-replica (etc.) blocks get written randomly to multiple DataNodes according to the rack-aware block-placement policy.

If the Client is NOT on a DataNode machine, then all the first-replica blocks get written randomly to a DataNode in the same rack. Second-replica etc. blocks get written to random DataNodes as above.

If the Client is WebHDFS, then all the first-replica blocks get written to one DataNode (this is a limitation of the way WebHDFS works: the NameNode will only give the WebHDFS client one DataNode to write to). This can be a problem when writing files larger than a single disk. Second-replica etc. blocks get written to random DataNodes as above.

avatar
Rising Star

I got it. So the documented behavior is accurate if the client is not directly running on a DataNode, but does not account for WebHDFS or a client directly connected to the node. Thanks!