My resource managers are active and so is Job history server. All my worker nodes had been exiting randomly for some time but used to restart automatically. today, all my node managers are down. what could be the reason? My worker nodes are typical with Hdfs And yarn on them. HDFS is running fine. what does it indicate when all node managers are down? There was no unusual load on servers. also, if I restart them, it still goes down. please suggest what could cause this?
9 worker nodes. These only have HDFS and Node managers installed on them. These shutdowns are the result of continuous exits by Node manager, however, I am not able to understand why my node managers are running into continuous exits. Would be really great. These run into unexpected exists even when there are a handful of jobs running and it keep happening throughout the data. I have tried looking through the logs but not seeing any errors there. @Geoffrey Shelton Okot
About all node managers going down, On restarting node managers, I realized it was picking a lot of containers from yarn-nm-recovery so I got rid of that folder. Now, all my node managers are not down but still running into continuous exits and I seem to have no way to debug this. I have allocated 2GB heap space and I can see it does not need more than a GB. The only thing that I see could be a problem is number of java threads waiting . It's about 40-50 and also 50-60 threads running at a time
If you are have only the datanode and nodemanger on the worker nodes !
Whats the NameNode heap size?
Can you upload the DataNode GC log to find out if the Garbage Collecting was configured well?
Whats your Datanode JVM options? to see improvements.
HDP provided a formula to calculate the Heap Size of a NameNode but not for the Data node .
NameNode heap size is 5GB. DataNode heap size is 2GB. JVM option for datanode:-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled . Java option for NameNode : -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled . @Geoffrey Shelton Okot
5GB corresponds to 50-70 million files is that what you expect your NameNode to handle? NameNode heap size depends on many factors such as the number of files, the number of blocks and the load on the system.
For Namenode a good rule of thumb is 1GB for 100TB of data in HDFS.
HADOOP_HEAPSIZE sets the JVM heap size for all Hadoop components such as HDFS, YARN, and MapReduce. it is an integer passed to the JVM as the maximum memory (Xmx) argument
Want is the value of your HADOOP_HEAPSIZE?