Created 02-24-2016 02:18 PM
Hello Everyone , i have doubts related to hadoop checksum calculation :
In O'reilly i could see below line :
"Datanodes are responsible for verifying the data they receive before storing the data and its checksum "
1. Does this mean that checksum will be calculated before data reaches datanode for storage ???
"A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum"
2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ???
"When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes."
3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ??
A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes
4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ?????
Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ???
When a client successfully verifies a block, it tells the datanode, which updates its log . Keeping statistics such as these is valuable in detecting bad disks.
6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation.
Created 02-24-2016 06:31 PM
Hello @sameer khan. Addressing the questions point-by-point:
1. Does this mean that checksum will be calculated before data reaches datanode for storage ???
Yes, an end-to-end checksum calculation is performed as part of the HDFS write pipeline while the block is being written to DataNodes.
2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ???
The intent of the checksum calculation in the write pipeline is to verify the data in transit over the network, not check bit rot on disk. Therefore, verification at the final node in the write pipeline is sufficient. Checking for bit rot in existing replicas on disk is performed separately at each DataNode by a background thread.
3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ??
Yes, the checksum is persisted at the DataNode. For each block replica hosted by a DataNode, there is a corresponding metadata file that contains metadata about the replica, including its checksum information. The metadata file will have the same base name as the block file, and it will have an extension of ".meta".
4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ?????
The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica. All checksums for all byte ranges must be valid in order for HDFS to consider the replica to be valid.
5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ???
The files are prefixed with "dncp_block_verification.log" and will be stored under one of the DataNode data directories as configured by dfs.datanode.data.dir in hdfs-site.xml. The content of these files is multiple lines, each reporting date, time and block ID for a replica that was verified.
6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation.
This only logs checksum verification failures that were detected in the background by the DataNode. If a client detects a checksum failure at read time, then the client reports the failure to the NameNode, which then recovers by invalidating the corrupt replica and scheduling re-replication from another known good replica. There would be some logging in the NameNode log related to this activity.
Created 02-24-2016 06:31 PM
Hello @sameer khan. Addressing the questions point-by-point:
1. Does this mean that checksum will be calculated before data reaches datanode for storage ???
Yes, an end-to-end checksum calculation is performed as part of the HDFS write pipeline while the block is being written to DataNodes.
2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ???
The intent of the checksum calculation in the write pipeline is to verify the data in transit over the network, not check bit rot on disk. Therefore, verification at the final node in the write pipeline is sufficient. Checking for bit rot in existing replicas on disk is performed separately at each DataNode by a background thread.
3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ??
Yes, the checksum is persisted at the DataNode. For each block replica hosted by a DataNode, there is a corresponding metadata file that contains metadata about the replica, including its checksum information. The metadata file will have the same base name as the block file, and it will have an extension of ".meta".
4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ?????
The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica. All checksums for all byte ranges must be valid in order for HDFS to consider the replica to be valid.
5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ???
The files are prefixed with "dncp_block_verification.log" and will be stored under one of the DataNode data directories as configured by dfs.datanode.data.dir in hdfs-site.xml. The content of these files is multiple lines, each reporting date, time and block ID for a replica that was verified.
6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation.
This only logs checksum verification failures that were detected in the background by the DataNode. If a client detects a checksum failure at read time, then the client reports the failure to the NameNode, which then recovers by invalidating the corrupt replica and scheduling re-replication from another known good replica. There would be some logging in the NameNode log related to this activity.
Created 02-25-2016 06:26 AM
Thank you so much chris really appreciated ur response. Have few more queries from ur responses :
1. The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica.
Why there are checksums of different byte ranges , since we know that for every 512 bytes of data by default a checksum will be calculated of 4 bytes of length. So my question is in this file all the checksums should be of same length right ????
2. Also dfs.byte-per-checksum by default takes 512 bytes , cant we configure this value to be of 1 GB or more so that there will be less checksums and hence space would be free. ???
Created 04-16-2018 10:44 AM
The DataNode stores a single ".meta" file corresponding to each block replica or For each block replica hosted by a DataNode, there is a corresponding metadata file which is true?
Created 02-25-2016 06:33 AM
Hello @Chris Nauroth,
One more question :
1. hadoop fs -checksum <<filename>> will give the checksum of the file.
When this command is issued, does namenode reads the data from all the blocks(associated with the input file ) of respective data nodes and calculates the checksum and gives it at the terminal .???
I got this question since i came to know that when we copy the file from one cluster to another using distcp command , we can compare if both the files have the same content by using the checksum option as mentioned in the above command.
Created 02-25-2016 08:53 PM
@sameer khan, "hadoop fs -checksum" works a little differently than what you described. The client contacts the NameNode to get the locations of each block that make up the file. Then, it calls a DataNode hosting a replica of each of those blocks and asks it to return the checksum information that it has persisted in the block metadata. After receiving this checksum information for every block in the file, the individual block checksums are combined to form the final overall file checksum. The important distinction I want to make is that "hadoop fs -checksum" does not involve reading the entire byte-by-byte contents of the file. It only involves interrogating the block metadata files.
Created 02-26-2016 04:19 AM
Hey @Christ , Thanks again. You are real saviour.
Kindly answer these same
1. The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica.
Why there are checksums of different byte ranges , since we know that for every 512 bytes of data by default a checksum will be calculated of 4 bytes of length. So my question is in this file all the checksums should be of same length right ????
2. Also dfs.byte-per-checksum by default takes 512 bytes , cant we configure this value to be of 1 GB or more so that there will be less checksums and hence space would be free. ???
Created 02-26-2016 08:01 PM
@Tabrez Basha Syed, yes, the metadata format is such that the length of each checksum is the same and well-known in advance of reading it.
dfs.bytes-per-checksum represents a trade-off. When an HDFS client performs reads, it must read at a checksum boundary to recalculate and verify the checksum successfully. Assume a 128 MB block size, and also assume dfs.bytes-per-checksum set to 128 MB, so that there is only a single checksum boundary. Now assume that a reading client wants to seek to the middle of a block and only read a range of data starting from that point. If there was only a single checksum, then it would still have to start reading from the beginning of the block, just to recalculate the checksum, even though it doesn't want to read that data. This would be inefficient. With dfs.byte-per-checksum set to 512, a seek and read can begin checksum recalculation on any 512-byte boundary.
In practice, assuming a 128 MB block size and dfs.bytes-per-checksum set to 512, the block metadata file will be ~1 MB. That means that < 1% of HDFS storage capacity is dedicated to checksum storage, so it's an appropriate trade-off. It's rare to need tuning of dfs.bytes-per-checksum.
Created 04-16-2018 10:19 AM