Created 07-17-2022 09:48 PM
Hi everyone.
I am getting stale alerts every hr, if the last point of contact between Data nodes and Namenode is more than 30 s we get these alerts.
I am not able to find the root cause of this slowness, I have 32 cores system, but when this alert is generated in htop hdfs usage is more but not all cores are 100% utilized.
DataNode Health Summary
Created 07-18-2022 02:00 AM
Hello @KPG1 ,
The time taken to mark a datanode as stale is give by dfs.namenode.stale.datanode.interval, with a default of 30 seconds. If this is happening with a specific Datanode, you can check if there is any network issues between the Datanode and the Namenode or if the Datanode has any JVM pauses reported by checking the Datanode logs. As a bandaid, you can bump up the above parameter till the underlying problem is solved.
Created 07-18-2022 02:00 AM
Hello @KPG1 ,
The time taken to mark a datanode as stale is give by dfs.namenode.stale.datanode.interval, with a default of 30 seconds. If this is happening with a specific Datanode, you can check if there is any network issues between the Datanode and the Namenode or if the Datanode has any JVM pauses reported by checking the Datanode logs. As a bandaid, you can bump up the above parameter till the underlying problem is solved.