Created on 09-05-202303:33 PM - edited 09-05-202304:00 PM
High level Problem Explanation
The primary garbage collection challenge often arises when the heap configuration of the Namenode or Datanode is inadequate. GC pauses can trigger Namenode crash, failovers, and performance bottlenecks. In larger clusters, there is no response at the time of transitioning to an another Namenode.
Detailed Technical Explanation
The Namenode comprises three layers of management:
All of these layers are stored in the Namenode in-memory heap, and any substantial modifications to them can lead to increased heap usage.
Similarly, the Datanode also maintains block mapping information in its memory.
High heap usage can be caused by several factors, including:
Inadequate heap size for the current dataset.
Lack of proper heap tuning configurations.
Failure to include "-XX:CMSInitiatingOccupancyFraction" with the default value set at 92.
Gradual growth in the number of inodes/files over time.
Sudden spikes in HDFS data volumes.
Sudden changes in HDFS snapshot activity.
Job executions generating millions of small files.
System slowness, including CPU or IO bottlenecks.
Namenode failure and Namenode transitions
Namenode logs showing JVM metrics, with any pause exceeding 10 seconds being a concern.
WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 143694ms GC pool 'ParNew' had collection(s): count=1 time=0ms GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=143973ms
Occasionally, GC pauses are induced by system overhead factors, such as CPU spikes, IO contention, slow disk performance, or kernel delays. Look for the phrase "No GCs detected" in the Namenode log.
WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 83241ms No GCs detected
Do not misinterpret the FSNamesystem lock message; the root cause of the issue is actually GC overhead the pause before this event.
Number of suppressed read-lock reports: 0
Longest read-lock held at xxxx for 143973ms via java.lang.Thread.getStackTrace(Thread.java:1559)
Relevant Data to Check
To gain insights into usage patterns, it's advisable to regularly collect heap histograms. If you have enabled the parameters -XX:+PrintClassHistogramAfterFullGC and -XX:+PrintClassHistogramBeforeFullGC in the heap configuration, there's no need for additional histogram collection.
Cloudera Manager -> Charts -> Chart Builder -> SELECT jvm_gc_rate WHERE roleType = NAMENODE and hostname = "<problamatic namenode hostname>"
SELECT jvm_max_memory_mb, jvm_heap_used_mb WHERE roleType = NAMENODE and hostname = "<problamatic namenode hostname>"
Minor pauses are generally tolerable, but when pause times exceed the defined "ha.health-monitor.rpc-timeout.ms" timeout, the risk of a Namenode crash increases, particularly if the Namenode writes edits after a GC pause. Also, Heap usage reaching 85% of the total heap can also lead to issues. Even slight pauses, such as 2-5 seconds occurring frequently, can pose problems.
To adjust the heap size effectively, consider the current system load and anticipated future workloads. Given the diverse range of customer use cases, our standard guidelines may not always align perfectly with each scenario. Factors like differing file lengths, snapshot utilization, encryption zones, and more can influence requirements.
Instead, it's advisable to analyze the current number of files, directories, and blocks in conjunction with the existing heap utilization. For instance, if you have 30 million files and directories, 40 million blocks, and 200 snapshots, and your current heap usage is 45GB out of a total heap size of 50GB, you can estimate the additional heap required for a 50% load increase. In this case, you would need an additional 23GB of heap (current_heap * additional_load_percentage/100) to accommodate the increased load.
Please note that you can find the current heap usage metrics in the Namenode web UI or through CM (Cloudera Manager) charts by querying: "SELECT jvm_max_memory_mb, jvm_heap_used_mb WHERE roleType = NAMENODE and hostname = "<problematic Namenode hostname>""
To ensure smooth operation, it's wise to allocate an additional 10-15% of heap beyond the total configured heap. If Sentry is configured, consider adding a 10-20% buffer.
It's crucial to maintain the total count of files and directories in the cluster below 350 million. Namenode performance tends to decline after surpassing this threshold. We generally recommend not exceeding 300 million files per cluster.
Please cross-reference our guidance with official heap documents provided by CDP.
If a customer has configured G1 Garbage Collection for the Namenode and there is no notable improvement in Namenode GC performance, consider using CMS. Note, During our internal testing, we observed no significant performance improvement in the Namenode when using G1 Garbage Collection but leave it to your own testings.
For Datanode performance, ensure that the Java heap is set to 1GB per 1 million blocks. Additionally, verify that all Datanodes are balanced, with a roughly equal distribution of blocks. Avoid scenarios where one node holds a disproportionately high number of blocks compared to others.