We have a cluster running HDP 2.4 with 8 worker nodes. Recently two of our datanodes go down frequently - usually they both go down at least once a day, frequently more often than that. While they can be started up without any difficulty, they will usually fail again within 12 hours. There is nothing out of the ordinary in the logs except very long GC wait times before failure. For example, shortly before failing this morning, I saw the following in the logs:
2017-04-25 03:49:27,529 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23681ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=23769ms
I checked the free memory on that node, and it was slightly more than the free memory on a similar node that isn't shutting down. Since it is the same two nodes repeatedly, I assume the problem is something to do with the nodes themselves. Does anybody have any advice for this problem?
Hi @Mark Heydenrych, it is likely that your DataNodes are not configured with sufficient Java heap. Even though there is free RAM on the machine, the Java runtime will not use memory beyond it's configured maximum heap size which is specified via the -Xmx command-line option. You may be seeing on this on only a few DataNodes because they wound up with more blocks.
This setting can be changed via the HADOOP_DATANODE_OPTS environment variable in Advanced hadoop-env.sh via Ambari.
I recommend starting by doubling the heap allocation for the DataNode and also adding the following options if not present already: