Hello HCC,
We recently upgraded our prod and all dev cluster from HDP 2.5.3.0 to HDP 2.6.1.0, as we are observing weird behavior in HDP 2.6.1.0 some of the nodes are getting very high allocation of containers causing very high avg load on the server and that is causing nodes to go in heart beat lost state.
When the nodes got very high avg load NN making those nodes as Dead nodes where as RM still keep on assigning containers ( we know that both RM and NN work independently) on that node and all those containers are causing jobs to go in failed state. Every time when we having this issue we are asking our SA team to reboot those servers to alleviate the issue, we didn't had this behavior with HDP 2.5.3.0.
Please find the screenshot for reference where nodes got very high no.of containers and load avg
Present versions : HDP 2.6.1.0 and Ambari 2.5.2.0
@Kuldeep Kulkarni @Jay SenSharma @Artem Ervits @ssathish
ss-1.pngss-2.pngss-3.pngss-4.png