Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

DataNode stopped and not starting now with - Failed to add storage for block pool

avatar
New Contributor

One datanode went down and while starting it failing with following errors:

WARN common.Storage (DataStorage.java:addStorageLocations(399)) - Failed to add storage for block pool: BP-441779837-135.208.32.109-1458040734038 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /opt/app/data11/hadoop/hdfs/data/current/BP-441779837-135.208.32.109-1458040734038

FATAL datanode.DataNode (BPServiceActor.java:run(878)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020. Exiting.

java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1336) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1301) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:866) at java.lang.Thread.run(Thread.java:745)

FATAL datanode.DataNode (BPServiceActor.java:run(878)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020. Exiting.

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:261) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1349) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1301) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:866) at java.lang.Thread.run(Thread.java:745)

WARN datanode.DataNode (BPServiceActor.java:run(899)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020

WARN datanode.DataNode (BPServiceActor.java:run(899)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020

INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid unassigned)

WARN datanode.DataNode (DataNode.java:secureMain(2417)) - Exiting Datanode

INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0

INFO datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:

3 REPLIES 3

avatar
Super Collaborator

the message could be caused by a process still or already accessing the file. Try to check if this is the case by:

lsof | grep /opt/app/data11/hadoop/hdfs/data/current/BP-441779837-135.208.32.109-1458040734038

The first three columns are:

  • command
  • process id
  • user

If there is a process locking the file, this should help you to identify it.

avatar
Super Collaborator

One question: have you performed an upgrade of HDFS?
You may also want to check with:

hdfs fsck / -includeSnapshots

avatar
New Contributor

Thanks Harald for your inputs.

While investigating further, we found that one disk on this datanode host was not healthy (was read_only) . After replacing disk, issue was resolved. Disk tolerance was also set to 0 on cluster due to this data node was not getting up.

We didn't performed any upgrade recently.