We have a cluster running HDP 2.4 with 8 worker nodes. Recently two of our datanodes go down frequently - usually they both go down at least once a day, frequently more often than that. While they can be started up without any difficulty, they will usually fail again within 12 hours. There is nothing out of the ordinary in the logs except very long GC wait times before failure. For example, shortly before failing this morning, I saw the following in the logs:
2017-04-25 03:49:27,529 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23681ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=23769ms
I checked the free memory on that node, and it was slightly more than the free memory on a similar node that isn't shutting down. Since it is the same two nodes repeatedly, I assume the problem is something to do with the nodes themselves. Does anybody have any advice for this problem?