Support Questions
Find answers, ask questions, and share your expertise

Hadoop Checksum Calculation Doubts

Hello Everyone , i have doubts related to hadoop checksum calculation :


In O'reilly i could see below line :


"Datanodes are responsible for verifying the data they receive before storing the data and its checksum "


1. Does this mean that checksum will be calculated before data reaches datanode for storage ???


"A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum"


2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ???



"When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes."


3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ??


A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes



4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ?????


Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.


5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ???


When a client successfully verifies a block, it tells the datanode, which updates its log . Keeping statistics such as these is valuable in detecting bad disks.


6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation.