Support Questions

Find answers, ask questions, and share your expertise

CDH3: disk failure, datanode doesn't start even after disk replacement

avatar
Guru
Hi,

in our CDH3 cluster (hadoop-0.20.2, yes, it's pretty old 😉 ) we had a disk failure on one node and thereby the datanode went down.
After replacing the disk and setting up directories/permissions, starting the datanode still fails with this error:

2014-04-15 16:14:43,165 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 5, volumes configured: 6, volumes failed: 1, volume failures tolerated: 0
    at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1025)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:416)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:303)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1643)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1583)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1601)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1727)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1744)


How to tell the datanode that the disk has been replaced, or how to "enable" the replaced disk ?!?!
I don't want to configure a tolerated disk failure of 1 to be able to start the datanode 😉

br, Gerd

1 ACCEPTED SOLUTION

avatar
Guru

Hi,

 

issue has been solved. Problem was that there was a mismatch between directory permissions and ownership (owner was 700, not the permissions, stupid thing 😉 ).

Nevertheless the error message is somehow misleading and it would preferrably print that the user/permissions are incorrect.

 

Gerd

View solution in original post

1 REPLY 1

avatar
Guru

Hi,

 

issue has been solved. Problem was that there was a mismatch between directory permissions and ownership (owner was 700, not the permissions, stupid thing 😉 ).

Nevertheless the error message is somehow misleading and it would preferrably print that the user/permissions are incorrect.

 

Gerd