Support Questions

jeden · ‎05-03-2016

When writing files larger than one block, how are blocks distributed across the datanodes? Documentation seems to indicate large files are split across datanodes (whenever possible), but I'm not sure this is always the case.

Justin_Watkins · ‎05-03-2016

The block placement in HDFS depends on a few things:

If the Client application is on a DataNode machine (e.g. a Pig script running on a node in the cluster), then HDFS will attempt to write all the first-replica blocks to that DataNode - because it is the "closest". Some blocks may get written to other DataNodes, for example if the first DataNode is full. Second-replica and third-replica (etc.) blocks get written randomly to multiple DataNodes according to the rack-aware block-placement policy.

If the Client is NOT on a DataNode machine, then all the first-replica blocks get written randomly to a DataNode in the same rack. Second-replica etc. blocks get written to random DataNodes as above.

If the Client is WebHDFS, then all the first-replica blocks get written to one DataNode (this is a limitation of the way WebHDFS works: the NameNode will only give the WebHDFS client one DataNode to write to). This can be a problem when writing files larger than a single disk. Second-replica etc. blocks get written to random DataNodes as above.

View solution in original post

sunile_manjee · ‎05-03-2016

@jeden when a file exceeds a block size, NN will try and put the next split in another data node within the same rack. Please read replica placement here.

sunile_manjee · ‎05-03-2016

@jeden NN may also put the next split on the same data node.

bleonhardi · ‎05-03-2016

Are you sure? I was pretty sure that even big files have their first copy written locally.

sunile_manjee · ‎05-04-2016

@Benjamin Leonhardi that is if the client is located on a specific data node. on a edge node (not associated with data node) NN will try and colocate the splits but I don't believe it can guarantee same node. I believe it will be at the very minimum same rack.

bleonhardi · ‎05-03-2016

Normally they are not. Hadoop tries to write the first copy of every block locally if possible ( this has a lot of nice characteristics if the same nodes normally read the files they write ( for example HBase Region Servers. )

So the general rule is:

One client writes a big file with 10 blocks:

- HDFS tries to write a first version of each block to the local datanode if the uploading client is colocated with a datanode in rack1 ( otherwise any node is used)

- While the block is written the datanode initiates a copy to another node in a different rack ( rack2 )

- while that block is written the second datanode initiates a copy to another node in the same rack as itself ( rack2 )

So you essentially have a copy chain, however not all three writes need to finish successfully. If at least the first copy is written successfully the block is assumed as written and the client will start writing the second block and the third and so on.

But the first copy of each will per default end up on the same datanode colocated with the client. As said this has a lot of good results in practice.

However all of that goes out the window when the datanode should run full.

Justin_Watkins · ‎05-03-2016

The block placement in HDFS depends on a few things:

If the Client application is on a DataNode machine (e.g. a Pig script running on a node in the cluster), then HDFS will attempt to write all the first-replica blocks to that DataNode - because it is the "closest". Some blocks may get written to other DataNodes, for example if the first DataNode is full. Second-replica and third-replica (etc.) blocks get written randomly to multiple DataNodes according to the rack-aware block-placement policy.

If the Client is NOT on a DataNode machine, then all the first-replica blocks get written randomly to a DataNode in the same rack. Second-replica etc. blocks get written to random DataNodes as above.

If the Client is WebHDFS, then all the first-replica blocks get written to one DataNode (this is a limitation of the way WebHDFS works: the NameNode will only give the WebHDFS client one DataNode to write to). This can be a problem when writing files larger than a single disk. Second-replica etc. blocks get written to random DataNodes as above.

jeden · ‎05-03-2016

I got it. So the documented behavior is accurate if the client is not directly running on a DataNode, but does not account for WebHDFS or a client directly connected to the node. Thanks!

Cloudera Community

Support Questions

HDFS File Placement when File Size Exceeds Block Size

DELETE rows in table, how HDFS file size is impact...

HDFS Audit Log File Size Issues

Explaining "block missing" and "block corruption" ...

NiFi mergeontent max file size handle

Uploading Files for Cloudera Support - alternate m...

How to limit the size of ranger log and number of ...

hdfs block size reducing

Namenode down due to java.lang.OutOfMemoryError: R...

Controlling size of the kafka.out log file.

Sizing CML Workspaces: Must-Knows for properly pla...