We have a problem with NodeManagers timeouts in our HDP-3.1.4 cluster (Yarn-3.1.1, Spark3) (cluster is Kerberized and we are using Ambari).
After restarting NodeManagers services problem is resolved for some time (about 1,5 weeks) and it happens again.
Most timeouts are observed when there is a heavy load in the cluster.
We don't observe any special events in the node manager logs.
If we turn off hbase service on the cluster the problem is returning in 2 or 3 weeks.
When the timeout happens the service is listening all the time on the datanode where this situation is happening (I checked it using: netstat -tulpn | grep 8042).
This problem is observed in the Alerts in Ambari: the NodeManager WebUI and NodeManager Health alerts are triggered (connection timeout to the DataNode on port 8042).
When there is a huge number of these timeouts (because we observe more timeouts when time is passing) our tasks on the cluster are failing more often.
What can we do with this problem?