01-04-2016 12:24 PM
I'm using 2.6.0-cdh5.4.5, using a Flume sink to write to HDFS periodically. This works out well, but rarely I see missing blocks. I've gone through the FAQ and likely culprits with no light being shed on it. I'm wondering if someone can give pointers on what else I may be able to look into?
In this latest situation I have 1 missing blocks with replication=3. I've given the highlights from two pertinent logs below - other datanodes look pretty much like this one. It's like the file never was written to disk and the DirectoryScanner cleans it from memory wiping away its existence. But then, that'd be pretty odd since it works 99.9% of the time. I've checked /var/log/messages around the same time on all nodes and observe no anomalies. I'll note that another flume agent is writing at the same time successfully as well.
Any ideas are greatly appreciated.
Day 1: 06:02 - the file completes
Day 1: 06:02 - the blockMap updated with the three replicas as UNDER_CONSTRUCTION
Day 1: 06:02 - the file is closed
Day 1: 06:02 - the FSNamesystem claiming there are no corrupt file blocks
Day 2: 07:25 - BlockStateChanged processReport...
Day 2: 07:25 - ask node 1 to replicate blk123 to node 2
Day 2: 07:25 - Error report DataNodeRegistration(node1 can't send invalid block 123)
Day 1: 06:02 - PacketResponder blk123 type=HAS_DOWNSTREAM_IN_PIPELINE is terminating
Day 2: 04:51 - FSDatasetImpl: Removed block 123 from memory with missing block file on the disk.
Day 2: 04:51 - FSDatasetImpl: Deleted a metadata file for the deleted block ...
Day 2: 07:25 - Can't send invalid block blk123
02-09-2016 01:06 PM
1. I think you should try to identify what file is missing:
hadoop fsck /
2. Are all your DN up and runnning?
I know that you can get corrupt blocks if you are writting very small files (when flume is opening a file and then it has to close it but there is no written information ).
02-10-2016 03:41 AM