Created on
02-21-2026
03:01 AM
- last edited on
04-20-2026
11:10 PM
by
GrazittiAPI
I'm experiencing very slow HDFS start in CDP 7.1.7SP1 for a cluster with a huge number of blocks (over 300 million, with each server having up to 40 million)
I've checked this
https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-5/ta-p/327450
and I wonder if setting dfs.blockreport.split.threshold to 0 might somehow speed up the process
I've seen that the setting should go in
NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xm
Is this setting service wide so that a full restart is necessary?
Created 02-23-2026 03:55 AM
The lock held above has 6-8 sec lock this will not cause the slowness also above is from service rpc while block is reporting to NN,
Check any lock held more than 10-15 sec,
Heap utilisation and Heap required is completely different for keeping 300M block you should required 300GB although the heap utilisation is for current utilising jobs please review below doc:
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade/topics/cdpdc-hdfs.html
Created 02-22-2026 10:09 PM
Hello @ganzuoni ,
Thanks for reaching us out below is your answer:
As per the current config you have reached to the limition as CLDR doesnt recommend to have block more than 300M its the dead end and 10M block in each DN but you have 40 M blocks
Setting dfs.blockreport.split.threshold to 0 is good plan but does this is causing the slowness please confirm that first while checking "Block report queue is full" in Namenode logs,
More-over please check the NN logs for read lock held and write lock held anyone surpassing more than 10sec if yes then read the thread where its stuck, also check out:
1. Any snapshot policy is running
2. Balancer is running
3. Which user is doing which operation mostly in NN audit logs you will find
4. check out the pause if you have any in NN and DN logs, we recommend 1GB for 1M of blocks
Suppose in audit logs you find XXX@user is doing 1000's of getfile rpc more than other user then try to stop that job for sometimes to confirm if other speed up.
Created 02-23-2026 12:49 AM
Hello @Asfahan
thank you for the answer, yes, I understand that the cluster it'a little oversized
About the topic, I don't find any "Block report queue full" message but several write-lock with long duration but, strange enough, not during hdfs service startup
What I find is a number of request coming via NFS Gateway (around 3000/minute) and several GC (Allocation Failure) in gc log in the first 20 minutes of startup and several about a the end when all the datanodes reported thei blocks
The NN has 160GB of heap and DN 30GB
What I found strange is dfs_datanode_handler_count set to 3, that might be the cause of the original issue that forced me to restart the service
In fact, I was decommissioning one node and when I started, suddenly I've experience a huge performance degradation, even if network, hdfs and disk I/O were not so critical
(cluster Net I/O peak was 280 MB/s, hdfs I/O 190 MB/s, disk I/O write peak of 300 MB/s)
Created 02-23-2026 02:04 AM
Thanks @ganzuoni
As per the current size of NN 160GB for 300M blocks is very less you will see these type of GC allocation failure in the cluster, please increase it to 300-320GB and DN to 40GB atleas
handler count has the calculation you can review the below blog for such:
Your cluster is already over-utilised with not enough resource the moment you decommission the DN ti will start a replication policy which is again a huge BW job causing much performance issue.
First we need to fix the cluster with enough resources ,
can you give us a write lock held complete thread Also you find anything in audit logs
Created 02-23-2026 02:31 AM
Hi @Asfahan
yes, heap should be around 300GB but these is what NN say on webui
Heap Memory used 111.53 GB of 169.41 GB Heap Memory. Max Heap Memory is 169.41 GB.
For what concerns handlers, dfs_namenode_handler_count is 70 (it should be 80 with 17 datanodes) while dfs_datanode_handler_count is at it's default value of 3
On a different cluster I had this set to 24
this is the stack trace for a write-lock held in active NN
2026-02-20 11:01:44,596 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of suppressed write-lock reports: 0
Longest write-lock held at 1972-02-11 21:18:16,333+0100 for 6157ms via java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:262)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:226)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1696)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.processBlocksInternal(DatanodeAdminManager.java:703)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.pruneReliableBlocks(DatanodeAdminManager.java:644)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:572)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:506)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Total suppressed write-lock held time: 0.0
Created 02-23-2026 02:40 AM
On the data node the typical stack trace were these
2026-02-20 12:01:41,486 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held time above threshold: lock identifier: FsDatasetRWLock lockHeldTimeMs=8582 ms. Supp
ressed 0 lock warnings. Longest suppressed LockHeldTimeMs=0. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058)
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:160)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:220)
org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78)
org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1920)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:376)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:719)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:872)
java.lang.Thread.run(Thread.java:748)
2026-02-20 12:01:41,486 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=7442 ms.
Suppressed 3 lock wait warnings. Longest suppressed WaitTimeMs=414. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058)
org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222)
org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105)
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1646)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1303)
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:762)
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:178)
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:112)
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
java.lang.Thread.run(Thread.java:748)
2026-02-20 11:06:02,845 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=688 ms. S
uppressed 5 lock wait warnings. Longest suppressed WaitTimeMs=397. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058)
org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222)
org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105)
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1750)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:997)
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:899)
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:178)
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:112)
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
java.lang.Thread.run(Thread.java:748)
and this
2026-02-20 11:11:44,500 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=443 ms. Suppressed 1 lock wait warnings. Longest suppressed WaitTimeMs=412. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058)
org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222)
org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105)
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:115)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.validateBlockFile(FsDatasetImpl.java:2036)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReplica(FsDatasetImpl.java:808)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReplica(FsDatasetImpl.java:801)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getLength(FsDatasetImpl.java:794)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkBlock(FsDatasetImpl.java:1988)
org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlock(DataNode.java:2315)
org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlocks(DataNode.java:2372)
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:726)
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:684)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1334)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1380)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1307)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1290)
Created 02-23-2026 03:55 AM
The lock held above has 6-8 sec lock this will not cause the slowness also above is from service rpc while block is reporting to NN,
Check any lock held more than 10-15 sec,
Heap utilisation and Heap required is completely different for keeping 300M block you should required 300GB although the heap utilisation is for current utilising jobs please review below doc:
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade/topics/cdpdc-hdfs.html