Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS to many bad blocks due to checksum errors - Understanding -verifyMeta behaviour

avatar
New Contributor

We are trying to setup a hadoop installation and are using CDH-5.15.1. We have recently noticed that a lot of blocks are signalled as "bad" due to checksumming errors AFTER they have been finalized. Our installation is small - 8 datanodes each with a single SCSI disk of 50TB (Disk is actually part of SAN) on ext4 filesystem and a replication factor of 3.

 

I understand that a volume scanner runs which checks integrity of individual blocks by verifying checksum stored in the meta file with the checksum of the actual block. I am also aware of the "hdfs debug -verifyMeta" command which we can run on the datanode and check the checksum of the block with the one stored.

 

Once we had a few files flagged as corrupted due to missing blocks, I picked one block and checked all the nodes where it lived, On each node, the actual block file had the same size and creation time but different MD5 hash (obtained by running md5sum blk_XXXXXXXX). The meta files all had the same MD5 checksum. Also all the three copies failed the -verifyMeta test with a ChecksumError.  (hdfs debug -verifyMeta -block blk_XXXXXX -meta blk_XXXXX_YYY.meta threw checksumerror).

 

Interested, I scanned one node for more failing files and found a bunch. I concentrated on one block (blk_1073744395) which belonged to file A. I tracked the block to three nodes and all three had different MD5 checksums for the blockfile and all three failed the -verifyMeta test. The fsck -blockId returned a HEALTHY status for all three replicas. I decided to fetch the file from HDFS and did a -copyToLocal. The logs indicated that node1 threw a checksumerror but node2 fulflled the request correctly. The replica was then removed from Node1 and replicated on Node4 where again I found that it had a different MD5 and failed the -verifyMeta test.

 

My Questions are:

- Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw?

- Should all replicas of the block have same hash (say MD5)?

- What may be causing finalized blocks to start failing checksum errors if disk is healthy?

 

I would be grateful if someone could throw some light on the behaviour we are seeing in our datanodes.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

First of all, CDH didn't support SAN until very recently, and even now the support is limited.

https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.h...

 

Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. Refer to the Cloudera Enterprise Storage Device Acceptance Criteria Guide for more information about using non-local storage.
 
That said, I am interested in knowing more about your setup. What application writes those corrupt files? The HDFS in CDH5.15 is quite stable and most of the known data corruption bugs were fixed. Probably the only bug not in CDH5.15 is HDFS-10240, where Flume in a busy cluster could trigger this bug. But the symptom doesn't quite match your description anyway.
 

> - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw?

 I won't say that's impossible. But we've not seen such a case. The -verifyMeta implementation is quite simple actually.

 

- Should all replicas of the block have same hash (say MD5)?

If all replicas have the same size, they are supposed to have the same checksum. (We support append, no truncate.) If your SAN device is busy, there are chances where HDFS client would give up writing to the DataNode, replicate the blocks to a different DataNode and continue from there. In which case, replicas may have different file length, because some of the are stale.

 

- What may be causing finalized blocks to start failing checksum errors if disk is healthy?

An under performed disk or a busy DataNode could abort the write to that block. I can't give you a definitive answer because I don't have much experience with HDFS on SAN.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

First of all, CDH didn't support SAN until very recently, and even now the support is limited.

https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.h...

 

Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. Refer to the Cloudera Enterprise Storage Device Acceptance Criteria Guide for more information about using non-local storage.
 
That said, I am interested in knowing more about your setup. What application writes those corrupt files? The HDFS in CDH5.15 is quite stable and most of the known data corruption bugs were fixed. Probably the only bug not in CDH5.15 is HDFS-10240, where Flume in a busy cluster could trigger this bug. But the symptom doesn't quite match your description anyway.
 

> - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw?

 I won't say that's impossible. But we've not seen such a case. The -verifyMeta implementation is quite simple actually.

 

- Should all replicas of the block have same hash (say MD5)?

If all replicas have the same size, they are supposed to have the same checksum. (We support append, no truncate.) If your SAN device is busy, there are chances where HDFS client would give up writing to the DataNode, replicate the blocks to a different DataNode and continue from there. In which case, replicas may have different file length, because some of the are stale.

 

- What may be causing finalized blocks to start failing checksum errors if disk is healthy?

An under performed disk or a busy DataNode could abort the write to that block. I can't give you a definitive answer because I don't have much experience with HDFS on SAN.

avatar
New Contributor
Thanks for your reply.

We use Logstash to ship Bro event logs to hadoop. Earlier we used the WebHDFS interface but after a encountering a lot of badblocks (we thought APPEND may be causing something) we have switched to simple hdfs dfs -put. Logstash writes hourly log files (JSON lines) and we ship them to a gateway node and do a -put so I don't think its a client issue.

I am still a little surprised that a block that failed -verifyMeta test was actually deemed OK by HDFS and served.

I don't see any slow write in logs but I do see nodes in pipelines complaining about bad checksums (while writing) and giving up.

avatar
Expert Contributor
Thanks, really appreciate your sharing.

WebHDFS append operation was prone to a corrupt bug HDFS-11160, but that was fixed in CDH5.11.0.

> I don't see any slow write in logs but I do see nodes in pipelines complaining about bad checksums (while writing) and giving up.
That's an interesting observation. The checksum error should be a very rare event, if any. Without further details, I would suspect SAN has something to do with it. It's just such a rare setup in our customer install base that it's hard for me to tell what's the effect would be.

avatar
New Contributor
Thank you for that insight. I will mark your original post as accepted and maybe update the post later if we have any new information to share.