About Redbaron

Redbaron · ‎12-31-2018

Thank you for that insight. I will mark your original post as accepted and maybe update the post later if we have any new information to share.

Redbaron · ‎12-29-2018

Thanks for your reply. We use Logstash to ship Bro event logs to hadoop. Earlier we used the WebHDFS interface but after a encountering a lot of badblocks (we thought APPEND may be causing something) we have switched to simple hdfs dfs -put. Logstash writes hourly log files (JSON lines) and we ship them to a gateway node and do a -put so I don't think its a client issue. I am still a little surprised that a block that failed -verifyMeta test was actually deemed OK by HDFS and served. I don't see any slow write in logs but I do see nodes in pipelines complaining about bad checksums (while writing) and giving up.

Redbaron · ‎12-28-2018

We are trying to setup a hadoop installation and are using CDH-5.15.1. We have recently noticed that a lot of blocks are signalled as "bad" due to checksumming errors AFTER they have been finalized. Our installation is small - 8 datanodes each with a single SCSI disk of 50TB (Disk is actually part of SAN) on ext4 filesystem and a replication factor of 3. I understand that a volume scanner runs which checks integrity of individual blocks by verifying checksum stored in the meta file with the checksum of the actual block. I am also aware of the "hdfs debug -verifyMeta" command which we can run on the datanode and check the checksum of the block with the one stored. Once we had a few files flagged as corrupted due to missing blocks, I picked one block and checked all the nodes where it lived, On each node, the actual block file had the same size and creation time but different MD5 hash (obtained by running md5sum blk_XXXXXXXX). The meta files all had the same MD5 checksum. Also all the three copies failed the -verifyMeta test with a ChecksumError. (hdfs debug -verifyMeta -block blk_XXXXXX -meta blk_XXXXX_YYY.meta threw checksumerror). Interested, I scanned one node for more failing files and found a bunch. I concentrated on one block (blk_1073744395) which belonged to file A. I tracked the block to three nodes and all three had different MD5 checksums for the blockfile and all three failed the -verifyMeta test. The fsck -blockId returned a HEALTHY status for all three replicas. I decided to fetch the file from HDFS and did a -copyToLocal. The logs indicated that node1 threw a checksumerror but node2 fulflled the request correctly. The replica was then removed from Node1 and replicated on Node4 where again I found that it had a different MD5 and failed the -verifyMeta test. My Questions are: - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw? - Should all replicas of the block have same hash (say MD5)? - What may be causing finalized blocks to start failing checksum errors if disk is healthy? I would be grateful if someone could throw some light on the behaviour we are seeing in our datanodes.

Online	Offline
Last Visited	‎01-01-2019 12:16 AM

Member Since	‎12-28-2018 03:44 AM
Last Visited	‎01-01-2019 12:16 AM
Posts	3

Cloudera Community

Re: HDFS to many bad blocks due to checksum errors...

Re: HDFS to many bad blocks due to checksum errors...

HDFS to many bad blocks due to checksum errors - U...