Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Tasks attempting to read from failed hard drive

Tasks attempting to read from failed hard drive

Explorer

Over a week ago a drive failed in one of our data nodes that we have not replaced yet.  Today while a user was attempting to run a get command copy a file out of HDFS he got an error because it was trying to read a block off the failed drive. It then successfully connected to a different data node and then finished. My question is why was the NameNode sending the process to a node that should no longer be listed as having that block due to a failed drive?

$ ll /data/hadoop-data/10
ls: cannot access /data/hadoop-data/10: Input/output error

 

$ hadoop fs -get /data/discovery/<redacted> /home/m9tn/<redacted>
15/05/29 14:21:55 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, self=/<ipaddress redacted>, remote=/<ipaddress redacted>, for file /data/discovery/<redacted>, for pool BP-1211057805-<ipaddress redacted>-1411162331514 block 1076062177_2488147
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:785)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:663)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:456)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:382)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:319)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:254)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:239)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:234)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:211)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
15/05/29 14:21:55 WARN hdfs.DFSClient: Failed to connect to /<ipaddress redacted> for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/<ipaddress redacted>, remote=/<ipaddress redacted>, for file /data/discovery/<redacted>, for pool BP-1211057805-<ipaddress redacted>-1411162331514 block 1076062177_2488147
java.io.IOException: Got error for OP_READ_BLOCK, self=/<ipaddress redacted>, remote=/<ipaddress redacted>, for file /data/discovery/<redacted>, for pool BP-1211057805-10.96.243.40-1411162331514 block 1076062177_2488147
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:785)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:663)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:456)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:382)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:319)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:254)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:239)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:234)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:211)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
15/05/29 14:21:55 INFO hdfs.DFSClient: Successfully connected to /<ipaddress redacted> for BP-1211057805-<ipaddress redacted>-1411162331514:blk_1076062177_2488147

 

 

3 REPLIES 3

Re: Tasks attempting to read from failed hard drive

Explorer

Also, Cloudera Manager showed the drive failure right away:

  • The health test result for DATA_NODE_VOLUME_FAILURES has become concerning: The DataNode has 1 volume failure(s). Warning threshold: 1 volume(s).

Re: Tasks attempting to read from failed hard drive

What was the time difference between the disk dying and the file fetch being issued?
Has the datanode in question been restarted since the disk dying?
Regards,
Gautam Gopalakrishnan

Re: Tasks attempting to read from failed hard drive

Explorer

The drive failed on 5/24 the fetch was issued on 6/1. So, eight days.

 

I did restart the DataNode yesterday, after the failed fetch, and when it came up Cloudera Manager flagged the failed drive. I then went back through the Health history and noticed that it flagged the failed drive on 5/24. Also, fsck came back clean both before and after the restart, 0 bad blocks, 0 under replicated blocks.

 

Two different people ran the same get command and both got the same error on the same node.