Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

CDH3: disk failure, datanode doesn't start even after disk replacement

avatar
Guru
Hi,

in our CDH3 cluster (hadoop-0.20.2, yes, it's pretty old 😉 ) we had a disk failure on one node and thereby the datanode went down.
After replacing the disk and setting up directories/permissions, starting the datanode still fails with this error:

2014-04-15 16:14:43,165 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 5, volumes configured: 6, volumes failed: 1, volume failures tolerated: 0
    at org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1025)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:416)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:303)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1643)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1583)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1601)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1727)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1744)


How to tell the datanode that the disk has been replaced, or how to "enable" the replaced disk ?!?!
I don't want to configure a tolerated disk failure of 1 to be able to start the datanode 😉

br, Gerd

1 ACCEPTED SOLUTION

avatar
Guru

Hi,

 

issue has been solved. Problem was that there was a mismatch between directory permissions and ownership (owner was 700, not the permissions, stupid thing 😉 ).

Nevertheless the error message is somehow misleading and it would preferrably print that the user/permissions are incorrect.

 

Gerd

View solution in original post

1 REPLY 1

avatar
Guru

Hi,

 

issue has been solved. Problem was that there was a mismatch between directory permissions and ownership (owner was 700, not the permissions, stupid thing 😉 ).

Nevertheless the error message is somehow misleading and it would preferrably print that the user/permissions are incorrect.

 

Gerd