Created 04-19-2016 01:42 PM
Ex: 300 MB data . After split -> 128MB + 128MB + 44MB . My question , the thrid block 44MB will wait to receive 84MB data or it will find the free block and write 44MB data in datanode ?
Created 04-19-2016 01:49 PM
A HDFS block corresponds to 1 file in local file system on a datanode. So regardless of data-size, all the data will be broken into 128 MB data-files stored into local file system by default. The last chunk 84 MB will be also written to a new data-file. So you will find following block files (data-files) in your datanode local file system:
Here is an example:
/haadoop/data/dfs/data/
├── current
│ ├── BP-1079595417-192.168.2.45-1412613236271
│ │ ├── current
│ │ │ ├── VERSION
│ │ │ ├── finalized
│ │ │ │ └── subdir0
│ │ │ │ └── subdir1
│ │ │ │ ├── blk_1073741825 (128 MB)
│ │ │ │ ├── blk_1073741826 (128 MB)
│ │ │ │ ├── blk_1073741827 (84 MB)
Look for 'dfs.datanode.data.dir' property in HDFS configuration. It tells where these files (which represent HDFS blocks) are located on a datanode local file system.
Created 04-19-2016 01:49 PM
A HDFS block corresponds to 1 file in local file system on a datanode. So regardless of data-size, all the data will be broken into 128 MB data-files stored into local file system by default. The last chunk 84 MB will be also written to a new data-file. So you will find following block files (data-files) in your datanode local file system:
Here is an example:
/haadoop/data/dfs/data/
├── current
│ ├── BP-1079595417-192.168.2.45-1412613236271
│ │ ├── current
│ │ │ ├── VERSION
│ │ │ ├── finalized
│ │ │ │ └── subdir0
│ │ │ │ └── subdir1
│ │ │ │ ├── blk_1073741825 (128 MB)
│ │ │ │ ├── blk_1073741826 (128 MB)
│ │ │ │ ├── blk_1073741827 (84 MB)
Look for 'dfs.datanode.data.dir' property in HDFS configuration. It tells where these files (which represent HDFS blocks) are located on a datanode local file system.
Created 04-19-2016 01:58 PM
HDFS allocates space in blocks at a time and a block belongs to a file. If you have a file that takes up a partial block at the end, then that block (and its replicas) remain unfilled until an append is done to the file. If you append to the file, then the last block of the file (and its replicas) is used to hold the appended data until the block is full.
For very large files (which is mostly why people use Hadoop), having a max of <blocksize>MB (plus replicas) of space unused is not too large of a concern. For example, if you have a 99.9GB file, you would allocate 799 full blocks (at 128MB/block) and have one block that was only 20% full. That equates to about 0.1% unused space for that file.
Created 04-19-2016 02:31 PM
Emaxwell , As per my example , it will write 44MB data and it will append the data whenever the client request come back again right ?
If the client mayn't come back , the unused 84MB will be wasted or not used . Is that my understanding correct ?
Created 04-19-2016 06:26 PM
No it will not append data. Instead it will create a new block aka a new data file in Datanode local file system.
Created 04-19-2016 01:59 PM
As per design it won't wait for next 84MB data, it will directly write the 44 MB block. These blocks are referred as logical entity and internally it usage underlining ext3/ext4 disk blocks to write.
Created 04-19-2016 06:23 PM
As per my understanding , whatever data comes from client , it's splits data and started to write in datanode whenever it's find the free block from datanode . It won't append any data which is left over last time . Please correct me if my understanding is wrong .
Created 04-19-2016 06:39 PM
this is kind of old, but will give you a clear picture: http://www.formhadoop.es/img/HDFS-comic.pdf
Enjoy.