06-09-2018 10:44 PM
My resource managers are active and so is Job history server. All my worker nodes had been exiting randomly for some time but used to restart automatically. today, all my node managers are down. what could be the reason? My worker nodes are typical with Hdfs And yarn on them. hdfs is running fine. what does it indicate when all node managers are down? There was no unusual load on servers. also, if i restart them, it still goes down. please suggest what could cause this?
06-12-2018 11:29 PM
Check network connection between node managers and cloudera manager, this could a network issue. try to do a 100 MB file transfer between trouble hosts and healthy hosts, compare time between them.
if file transfer between nodes (with node manager down) is taking longer than expected, you have to contact your network team to check network switch connecting those nodes.