Created on 06-09-2018 10:44 PM - edited 09-16-2022 06:19 AM
My resource managers are active and so is Job history server. All my worker nodes had been exiting randomly for some time but used to restart automatically. today, all my node managers are down. what could be the reason? My worker nodes are typical with Hdfs And yarn on them. hdfs is running fine. what does it indicate when all node managers are down? There was no unusual load on servers. also, if i restart them, it still goes down. please suggest what could cause this?
Created 06-12-2018 11:29 PM
Check network connection between node managers and cloudera manager, this could a network issue. try to do a 100 MB file transfer between trouble hosts and healthy hosts, compare time between them.
if file transfer between nodes (with node manager down) is taking longer than expected, you have to contact your network team to check network switch connecting those nodes.
Created 06-12-2018 11:53 PM
But in that case, HDFS would be down as well, No? HDFS is installed on same servers as Node managers are and HDFS is working fine without any warnings or errors
Created 06-13-2018 12:01 AM
Also, Node manager continue to exit even if it is on same node as CM
Created 06-13-2018 12:19 AM
i faced similar issue with 'impalad'. where there was issue with network switch issue.
i suggest its worth trying.