We have a cluster running HDP 2.5 with 3 worker nodes. Recently two of our datanodes go down frequently - usually they both go down at least once a day, frequently more often than that. While they can be started up without any difficulty, they will usually fail again within 12 hours. There is nothing out of the ordinary in the logs except very long GC wait times before failure. For example, shortly before failing this morning, I saw the following in the logs:
i set datanode heap size to 16 gb and new generation to 8 gb.
The issue looks to be from java heap memory for data nodes (JvmPauseMonitor error points to that only), maybe due to huge data being handled on your systems. Try increasing the memory for java heap, hopefully that will resolve the issue.
I tried increasing Java heap memory for datanodes from 16 GB to 24 GB still same .,
I tried increasing Java heap memory for datanodes from 16Gb to 24 GB but still same issue.
Based on the following GC logging
Detected pause in JVM or host machine (eg GC): pause of approximately 23681msGC pool 'ParNew' had collection(s): count=1 time=0msGC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=23769ms
We see that the GC pause is comparatively very high (around 23 seconds) which can happen if the GC is not happening very aggressively, so the heap keeps growing with the time until it reaches to 90+% of the whole DataNode heap then the GC gets triggered.
In this case we can make the GC to happen nit aggressively by adding the following options in the DataNode JVM settings "HADOOP_DATANODE_OPTS"
Ambari UI --> HDFS --> Configs (Tab) --> Advanced (child Tab)--> "hadoop-env template" -> Find all the "HADOOP_DATANODE_OPTS" (includinf if .. else) both the blocks and add the following settings.
-XX:CMSInitiatingOccupancyFraction : The Throughput Collector starts a GC cycle only when the heap is full, i.e., when there is not enough space available to store a newly allocated or promoted object. With the CMS Collector, it is not advisable to wait this long because it the application keeps on running (and allocating objects) during concurrent GC. Thus, in order to finish a GC cycle before the application runs out of memory, the CMS Collector needs to start a GC cycle much earlier than the Throughput Collector.
-XX+UseCMSInitiatingOccupancyOnly : We can use the flag -XX+UseCMSInitiatingOccupancyOnly to instruct the JVM not to base its decision when to start a CMS cycle on run time statistics. Instead, when this flag is enabled, the JVM uses the value of CMSInitiatingOccupancyFraction for every CMS cycle, not just for the first one. However, keep in mind that in the majority of cases the JVM does a better job of making GC decisions than us humans. Therefore, we should use this flag only if we have good reason (i.e., measurements) as well as really good knowledge of the lifecycle of objects generated by the application.
Also Recommendation is to set the young generation heap size (-XX:MaxNewSize) to be set to (1/8th) of total Max heap. Young gen has the parallel collectors and most short lived objects are removed before it promoting to old gen, same old gen has enough space and threads to handle the promotions.
Also can you please check your filesystem to see if the DataNode process is crashing and generating the "hs_err_pid" files? If due to some reason the DataNode JVM process is getting crashed then you should see a file "hs_err_pid" it can be helpful to understand why the DataNode process is getting crashed. Your DataNode process might have the following option enabled by default which tells the JVM to generate the crash dump text file in the following location (just in case if it is crashing).
I have already added above -XX:CMSInitiatingOccupancyFraction=60-XX:+UseCMSInitiatingOccupancyOnly to HADOOP_DATANODE_OPTS for both if and else .
Can you please check the following :
1. If the "hs_err_pid" file for the DataNode is being generated?
2. Anything strange observed in the "/var/log/messages", when the DataNode went down.
3. Your OS has the SAR report enabled? This will help us in finding the Historical data of the events that occurred at the operating system level to find out if there is any thing unusual happened. (like spike on the memory usage/CPU/IO...etc). http://www.thegeekstuff.com/2011/03/sar-examples/
4. Have you recently upgraded your OS (kernel patches... etc)?