Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

DataNode stopped and not starting now with - Failed to add storage for block pool

New Contributor

One datanode went down and while starting it failing with following errors:

WARN common.Storage (DataStorage.java:addStorageLocations(399)) - Failed to add storage for block pool: BP-441779837-135.208.32.109-1458040734038 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /opt/app/data11/hadoop/hdfs/data/current/BP-441779837-135.208.32.109-1458040734038

FATAL datanode.DataNode (BPServiceActor.java:run(878)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020. Exiting.

java.io.IOException: All specified directories are failed to load. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1336) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1301) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:866) at java.lang.Thread.run(Thread.java:745)

FATAL datanode.DataNode (BPServiceActor.java:run(878)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020. Exiting.

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:261) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1349) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1301) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:866) at java.lang.Thread.run(Thread.java:745)

WARN datanode.DataNode (BPServiceActor.java:run(899)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020

WARN datanode.DataNode (BPServiceActor.java:run(899)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to <HOST/IP>:8020

INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid unassigned)

WARN datanode.DataNode (DataNode.java:secureMain(2417)) - Exiting Datanode

INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0

INFO datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:

3 REPLIES 3

Super Collaborator

the message could be caused by a process still or already accessing the file. Try to check if this is the case by:

lsof | grep /opt/app/data11/hadoop/hdfs/data/current/BP-441779837-135.208.32.109-1458040734038

The first three columns are:

  • command
  • process id
  • user

If there is a process locking the file, this should help you to identify it.

Super Collaborator

One question: have you performed an upgrade of HDFS?
You may also want to check with:

hdfs fsck / -includeSnapshots

New Contributor

Thanks Harald for your inputs.

While investigating further, we found that one disk on this datanode host was not healthy (was read_only) . After replacing disk, issue was resolved. Disk tolerance was also set to 0 on cluster due to this data node was not getting up.

We didn't performed any upgrade recently.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.