We recently upgraded our prod and all dev cluster from HDP 22.214.171.124 to HDP 126.96.36.199, as we are observing weird behavior in HDP 188.8.131.52 some of the nodes are getting very high allocation of containers causing very high avg load on the server and that is causing nodes to go in heart beat lost state.
When the nodes got very high avg load NN making those nodes as Dead nodes where as RM still keep on assigning containers ( we know that both RM and NN work independently) on that node and all those containers are causing jobs to go in failed state. Every time when we having this issue we are asking our SA team to reboot those servers to alleviate the issue, we didn't had this behavior with HDP 184.108.40.206.
Please find the screenshot for reference where nodes got very high no.of containers and load avg
Present versions : HDP 220.127.116.11 and Ambari 18.104.22.168
@Kuldeep Kulkarni @Jay SenSharma @Artem Ervits @ssathish