Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Expert Contributor

In certain Apache Hadoop use cases we want to get the checksum of files stored in HDFS. This is specifically useful when we are moving data from/to hdfs to verify the file was transferred correctly.

Earlier there was no easy way to compare that but starting Apache Hadoop 3.1 we can compare the checksums of a file stored in hdfs and a file stored locally. HDFS-13056

The default checksum algorithm for hdfs chunks is CRC32C. A client can override it by overriding dfs.checksum.type (can be either CRC32 or CRC32C). This is not a cryptographically strong checksum, however it can be used for quick comparison.

When we run the checksum command (hdfs dfs -checksum) for a hdfs file it calculates MD5 of MD5 of checksums of individual chunks (each chunk is typically 512 bytes long). However this is not very useful for comparison with a local copy.

Example

For example, the below command computes the checksum of the file hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar stored in HDFS:

hdfs dfs -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar MD5-of-0MD5-of-512CRC32C  000002000000000000000000c16859d1d071c6b1ffc9c8557d4909f1

However this checksum is not easily comparable to that of a local copy. Instead we can calculate the CRC32C checksum of the whole file by adding -Ddfs.checksum.combine.mode=COMPOSITE_CRC to same command:

bin/hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar COMPOSITE-CRC32C  3799db55

Property dfs.checksum.combine.mode=COMPOSITE_CRC tells hdfs to calculate combined CRC of individual CRCs instead of calculating MD5-of-Md5-of-Crcs.

It is important to note here that we can calculate checksum of type CRC32C or CRC32 for a hdfs file depending upon how it was originally written. For example we can't calculate CRC32 for file in above example as its chunks was originally written with CRC32C checksums.

If we want to get CRC32 of above file we need to specify dfs.checksum.type as CRC32 while writing that file.

hdfs dfs -Ddfs.checksum.type=CRC32 -put  hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar /tmp
hdfs dfs -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar MD5-of-0MD5-of-512CRC32  0000020000000000000000009f26e871c80d4cbd78b8d42897e5b364
hdfs dfs  -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar COMPOSITE-CRC32 c1ddb422

This checksum can be easily compared to checksum of same file in local file system with the crc32 command.

crc32 hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
c1ddb422
21,222 Views
0 Kudos