Support Questions

michel_weber · ‎06-23-2016

I would like to store the HDFS checksums of certain files on HDFS in an other location to detect tampering of the data in those files. Is this a good idea? Will future versions of HDFS deliver the same checksum values? Or should I calculate my own checksums based on the bytes in the raw files?

aengineer · ‎06-23-2016

@Michel Weber

> Is this a good idea?

Yes, certainly, but you need to be cognizant of the fact that most file system checksums are not tamper resistant. For example, CRC-32 or MD5 does not offer tamper resistance, because it is easy to create collusions. if you want to do this, at this point of time, you should be computing something like SHA-256, So that your purpose is achieved.

> Will future versions of HDFS deliver the same checksum values?

Generally HDFS tries not to break things, and as much as possible we will try to preserve backward compatible behaviour. Unfortunately in this specific case, I was not able to find any interface guarantee in code that implies we will always use the same checksums. In fact the FileChecksum class that we are returning has the length and name of algorithm used, along with the checksum bytes. So if you decide to use this feature ( once again not a cryptographically sound idea - since HDFS checksums are not strong enough to detect tampering), you should store Algorithm name, length of the hash as well as hash bytes. If hadoop changes the algorithm in future (quite possible) then at least you will be able to detect it.

> Or should I calculate my own checksums based on the bytes in the raw files? That is the smartest solution. Three reasons, one you are in full control and do not have to worry about changes in HDFS, second you can use a sound algorithm and three file system is an interface in HDFS. So the checksum returned by HDFS might not be same if you decide to use another file system like S3. With this decision you can move data to another file system or backup to another medium and will still be able to verify your data's integrity.

View solution in original post

jyadav · ‎06-23-2016

Yes, you can use hdfs checksum to find the file integrity on hdfs.

See below jira's for more information.

https://issues.apache.org/jira/browse/HADOOP-9209

https://issues.apache.org/jira/browse/HDFS-219

https://issues.apache.org/jira/browse/HADOOP-3981

aengineer · ‎06-23-2016

@Michel Weber

> Is this a good idea?

Yes, certainly, but you need to be cognizant of the fact that most file system checksums are not tamper resistant. For example, CRC-32 or MD5 does not offer tamper resistance, because it is easy to create collusions. if you want to do this, at this point of time, you should be computing something like SHA-256, So that your purpose is achieved.

> Will future versions of HDFS deliver the same checksum values?

Generally HDFS tries not to break things, and as much as possible we will try to preserve backward compatible behaviour. Unfortunately in this specific case, I was not able to find any interface guarantee in code that implies we will always use the same checksums. In fact the FileChecksum class that we are returning has the length and name of algorithm used, along with the checksum bytes. So if you decide to use this feature ( once again not a cryptographically sound idea - since HDFS checksums are not strong enough to detect tampering), you should store Algorithm name, length of the hash as well as hash bytes. If hadoop changes the algorithm in future (quite possible) then at least you will be able to detect it.

> Or should I calculate my own checksums based on the bytes in the raw files? That is the smartest solution. Three reasons, one you are in full control and do not have to worry about changes in HDFS, second you can use a sound algorithm and three file system is an interface in HDFS. So the checksum returned by HDFS might not be same if you decide to use another file system like S3. With this decision you can move data to another file system or backup to another medium and will still be able to verify your data's integrity.