Created 06-23-2016 07:40 AM
I would like to store the HDFS checksums of certain files on HDFS in an other location to detect tampering of the data in those files. Is this a good idea? Will future versions of HDFS deliver the same checksum values? Or should I calculate my own checksums based on the bytes in the raw files?
Created 06-23-2016 07:55 PM
> Is this a good idea?
Yes, certainly, but you need to be cognizant of the fact that most file system checksums are not tamper resistant. For example, CRC-32 or MD5 does not offer tamper resistance, because it is easy to create collusions. if you want to do this, at this point of time, you should be computing something like SHA-256, So that your purpose is achieved.
> Will future versions of HDFS deliver the same checksum values?
Generally HDFS tries not to break things, and as much as possible we will try to preserve backward compatible behaviour. Unfortunately in this specific case, I was not able to find any interface guarantee in code that implies we will always use the same checksums. In fact the FileChecksum class that we are returning has the length and name of algorithm used, along with the checksum bytes. So if you decide to use this feature ( once again not a cryptographically sound idea - since HDFS checksums are not strong enough to detect tampering), you should store Algorithm name, length of the hash as well as hash bytes. If hadoop changes the algorithm in future (quite possible) then at least you will be able to detect it.
> Or should I calculate my own checksums based on the bytes in the raw files? That is the smartest solution. Three reasons, one you are in full control and do not have to worry about changes in HDFS, second you can use a sound algorithm and three file system is an interface in HDFS. So the checksum returned by HDFS might not be same if you decide to use another file system like S3. With this decision you can move data to another file system or backup to another medium and will still be able to verify your data's integrity.
Created 06-23-2016 08:44 AM
Yes, you can use hdfs checksum to find the file integrity on hdfs.
See below jira's for more information.
https://issues.apache.org/jira/browse/HADOOP-9209
Created 06-23-2016 07:55 PM
> Is this a good idea?
Yes, certainly, but you need to be cognizant of the fact that most file system checksums are not tamper resistant. For example, CRC-32 or MD5 does not offer tamper resistance, because it is easy to create collusions. if you want to do this, at this point of time, you should be computing something like SHA-256, So that your purpose is achieved.
> Will future versions of HDFS deliver the same checksum values?
Generally HDFS tries not to break things, and as much as possible we will try to preserve backward compatible behaviour. Unfortunately in this specific case, I was not able to find any interface guarantee in code that implies we will always use the same checksums. In fact the FileChecksum class that we are returning has the length and name of algorithm used, along with the checksum bytes. So if you decide to use this feature ( once again not a cryptographically sound idea - since HDFS checksums are not strong enough to detect tampering), you should store Algorithm name, length of the hash as well as hash bytes. If hadoop changes the algorithm in future (quite possible) then at least you will be able to detect it.
> Or should I calculate my own checksums based on the bytes in the raw files? That is the smartest solution. Three reasons, one you are in full control and do not have to worry about changes in HDFS, second you can use a sound algorithm and three file system is an interface in HDFS. So the checksum returned by HDFS might not be same if you decide to use another file system like S3. With this decision you can move data to another file system or backup to another medium and will still be able to verify your data's integrity.