Support Questions
Find answers, ask questions, and share your expertise

Checksum in hadoop

Explorer

HI,

 

How checksum function will work in hadoop?

 

we can see checksum in file system commands and in distcp also we can use checksum option.

 

Can someone help me on what checksum will do in Hadoop?

 

Thanks,

NRG

1 REPLY 1

Re: Checksum in hadoop

Cloudera Employee

Hi NRG,

 

 

In HDFS, every file and even every block of a file has a checksum. This is calculated when the file/block is written, and it is used to check for data integrity (disk errors) when the file is read back.

 

Distcp can use the checksum to determine if two files are the same, so it can avoid copying the same large files over (for example when making backups). This is more reliable then using the "modified: timestamp for the files.

 

 

cheers,

zegab