Support Questions

Redbaron · ‎12-28-2018

We are trying to setup a hadoop installation and are using CDH-5.15.1. We have recently noticed that a lot of blocks are signalled as "bad" due to checksumming errors AFTER they have been finalized. Our installation is small - 8 datanodes each with a single SCSI disk of 50TB (Disk is actually part of SAN) on ext4 filesystem and a replication factor of 3.

I understand that a volume scanner runs which checks integrity of individual blocks by verifying checksum stored in the meta file with the checksum of the actual block. I am also aware of the "hdfs debug -verifyMeta" command which we can run on the datanode and check the checksum of the block with the one stored.

Once we had a few files flagged as corrupted due to missing blocks, I picked one block and checked all the nodes where it lived, On each node, the actual block file had the same size and creation time but different MD5 hash (obtained by running md5sum blk_XXXXXXXX). The meta files all had the same MD5 checksum. Also all the three copies failed the -verifyMeta test with a ChecksumError. (hdfs debug -verifyMeta -block blk_XXXXXX -meta blk_XXXXX_YYY.meta threw checksumerror).

Interested, I scanned one node for more failing files and found a bunch. I concentrated on one block (blk_1073744395) which belonged to file A. I tracked the block to three nodes and all three had different MD5 checksums for the blockfile and all three failed the -verifyMeta test. The fsck -blockId returned a HEALTHY status for all three replicas. I decided to fetch the file from HDFS and did a -copyToLocal. The logs indicated that node1 threw a checksumerror but node2 fulflled the request correctly. The replica was then removed from Node1 and replicated on Node4 where again I found that it had a different MD5 and failed the -verifyMeta test.

My Questions are:

- Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw?

- Should all replicas of the block have same hash (say MD5)?

- What may be causing finalized blocks to start failing checksum errors if disk is healthy?

I would be grateful if someone could throw some light on the behaviour we are seeing in our datanodes.

weichiu · ‎12-28-2018

First of all, CDH didn't support SAN until very recently, and even now the support is limited.

https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.h...

Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. Refer to the Cloudera Enterprise Storage Device Acceptance Criteria Guide for more information about using non-local storage.

That said, I am interested in knowing more about your setup. What application writes those corrupt files? The HDFS in CDH5.15 is quite stable and most of the known data corruption bugs were fixed. Probably the only bug not in CDH5.15 is HDFS-10240, where Flume in a busy cluster could trigger this bug. But the symptom doesn't quite match your description anyway.

> - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw?

I won't say that's impossible. But we've not seen such a case. The -verifyMeta implementation is quite simple actually.

- Should all replicas of the block have same hash (say MD5)?

If all replicas have the same size, they are supposed to have the same checksum. (We support append, no truncate.) If your SAN device is busy, there are chances where HDFS client would give up writing to the DataNode, replicate the blocks to a different DataNode and continue from there. In which case, replicas may have different file length, because some of the are stale.

- What may be causing finalized blocks to start failing checksum errors if disk is healthy?

An under performed disk or a busy DataNode could abort the write to that block. I can't give you a definitive answer because I don't have much experience with HDFS on SAN.

View solution in original post

weichiu · ‎12-28-2018