Member since
10-03-2020
235
Posts
15
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
494 | 11-11-2024 09:31 AM | |
1348 | 08-28-2023 02:13 AM | |
1874 | 12-15-2021 05:26 PM | |
1720 | 10-22-2021 10:09 AM | |
4856 | 10-20-2021 08:44 AM |
11-14-2024
10:43 AM
Thanks for getting back. The process_timestamp isn't there. It's not available on other running processes too. I had tried the work around, it didn't work, but I will give it another go. Another thing is the soft link for RegionServer process does not exist in /var/run/cloudera-scm-agent/supervisor/include directory.
... View more
05-24-2024
02:34 AM
Hi @NaveenBlaze , You can get more info from https://github.com/c9n/hadoop/blob/master/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java#L196 . Notice these two lines in this method doTailEdits FSImage image = namesystem.getFSImage(); streams = editLog.selectInputStreams(lastTxnId + 1, 0, null, false); editsLoaded = image.loadEdits(streams, namesystem);
... View more
04-18-2024
04:41 AM
Introduction
Apache Hadoop's efficiency and reliability significantly depend on the Java Virtual Machine (JVM)'s performance. To assist in monitoring and analyzing JVM pauses, especially those induced by Garbage Collection (GC), Hadoop integrates a utility named JvmPauseMonitor, a Hadoop common class introduced by https://issues.apache.org/jira/browse/HADOOP-9618 and widely used in Hadoop projects like HDFS, Hbase, Yarn, Oozie, MapReduce, etc.
Overview of JvmPauseMonitor
Link to reference code here
Key Features and Implementation
private String formatMessage(long extraSleepTime,
Map<String, GcTimes> gcTimesAfterSleep,
Map<String, GcTimes> gcTimesBeforeSleep) {
Set<String> gcBeanNames = Sets.intersection(
gcTimesAfterSleep.keySet(),
gcTimesBeforeSleep.keySet());
List<String> gcDiffs = Lists.newArrayList();
for (String name : gcBeanNames) {
GcTimes diff = gcTimesAfterSleep.get(name).subtract(
gcTimesBeforeSleep.get(name));
if (diff.gcCount != 0) {
gcDiffs.add("GC pool '" + name + "' had collection(s): " +
diff.toString());
}
}
String ret = "Detected pause in JVM or host machine (eg GC): " +
"pause of approximately " + extraSleepTime + "ms\n";
if (gcDiffs.isEmpty()) {
ret += "No GCs detected";
} else {
ret += Joiner.on("\n").join(gcDiffs);
}
return ret;
}
Monitoring Cycle:
It executes a sleep-wake cycle, periodically sleeping through the Thread.sleep(SLEEP_INTERVAL_MS) call, comparing the intended sleep duration against the actual elapsed time to identify JVM pauses.
SLEEP_INTERVAL_MS is defined as:
private static final long SLEEP_INTERVAL_MS = 500;
Evaluation of GC Metrics:
It assesses garbage collection metrics before and after the sleep interval to determine whether GC activities contributed to the detected pauses.
Compute Diff:
diff is computed by subtracting the GC metrics captured before the sleep (gcTimesBeforeSleep) from those after the sleep (gcTimesAfterSleep) for each garbage collector identified by name.
Determine GC type:
If GC events happened, the diff.gcCount variable will be positive, then logged as "GC pause ".
If no GC events are happening during the sleep interval, the diff.gcCount variable will be zero, then it is "No GC"
Logging Thresholds:
JvmPauseMonitor uses configurable thresholds (INFO_THRESHOLD_DEFAULT = 1000 for info-level logs and WARN_THRESHOLD_DEFAULT = 10000 for warn-level logs) to categorize the severity of detected pauses.
Understanding "GC" and "No GC"
"GC" Detected: If the difference (diff) in GC metrics (count and time) before and after the pause indicates an increase, it signifies that GC events occurred during the pause. These events are detailed in the log message, specifying which GC pools were active and the extent of their activity.
if (diff.gcCount != 0) {
gcDiffs.add("GC pool '" + name + "' had collection(s): " + diff.toString());
}
"No GC" Detected: Conversely, if no significant differences in GC metrics are found (i.e., gcDiffs is empty), the pause is logged without attributing it to GC activities. This scenario suggests the pause was caused by factors other than garbage collection, such as OS-level scheduling delays.
if (gcDiffs.isEmpty()) {
ret += "No GCs detected";
}
GC logging/debug:
Add the below parameters to JVM params, which usually can be added to java_opts in the Role's Configuration in CM.
Basic logging parameters for performance tuning:
-verbose:gc
-XX:-PrintGCCause
-XX:+PrintGCDetails
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCDateStamps
Additional parameters that may be used for more deep evaluation:
-XX:+PrintClassHistogramBeforeFullGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+PrintReferenceGC
-XX:+PrintTenuringDistribution
Either you can analyze the collected GC logs directly or use some tools like GCEasy(https://gceasy.io/)
Possible Reasons
Possible reasons for "GC" pauses inside JVMs are
Heap Size and Utilization problem
Improper GC parameters
JVM threads are overloaded
Known bugs or memory leaks
Many articles explain JVM GC issues, so we discuss it less here.
Possible reasons for the "No GC" pause:
It implies that garbage collection activities didn't cause the pause. Several factors outside of GC can lead to such pauses
Operating System Pauses
The application or the JVM process might be paused due to the operating system's actions, such as swapping memory to disk, other processes consuming excessive CPU resources, or OS-level maintenance tasks.
Using uptime , iostat , sar to find the OS utilization pattern.
Check the /var/log/message and `dmesg -T` for warnings from kernel
Cloudera Manager host charts, e.g. load average, Disk, and Network utilization.
Review the Diagnostic Bundle Alerts and health checks from CSI
Hardware Issues
Underlying hardware problems, such as failing disks, network issues, or memory errors, can lead to pauses in the JVM as it waits for IO operations or encounters hardware errors.
External System Calls
Calls to external systems or services, especially those that involve network communication or disk access, can introduce delays. If these calls block the main application threads, they can result in pauses not attributed to GC.
Example Case of a "No GC" pause that causes Namenode to crash
Log snippets:
"No GC pause happened"
2024-02-25 21:31:27,854 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7861ms
No GCs detected
2024-02-25 21:31:35,704 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7350ms
No GCs detected
NN is not responsive to sendEdits through QuorumjournalManager
2024-02-25 21:31:36,703 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8849 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2024-02-25 21:31:37,703 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9850 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2024-02-25 21:31:38,704 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10851 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2024-02-25 21:31:39,706 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11852 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
NN is fenced
2024-02-25 21:31:43,517 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal
2024-02-25 21:32:13,947 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
Charts
The "No GC" is due to a sudden write I/O on /root directory (sys disk) observed from Grafana:
(Chart 1) Disk utilization of directories
(Chart 2) Disk write I/O capacity per second
(Chart 3) Disk write IOPS
(Chart 4) Disk I/O operation time in 1 second
We found there's a cron job that runs "hdfs oiv" command to parse fsimage into fsimage.csv but it is generating the big csv file at /root directory which is the system disk causing the sudden I/O peak further causing the "No GC"
Reference:
Garbage Collection Pauses in Namenode and Datanode
... View more
Labels:
02-23-2024
11:31 AM
@Vishal3041 As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post. Thanks.
... View more
09-26-2023
11:32 PM
@ns2, I'm happy to see you resolved your issue. Could you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?
... View more
08-28-2023
11:07 PM
Thanks @willx for the response. I had earlier regenerated the keytabs and principals multiple times. For some reason, only the zookeeper principals seems have been locked up and not getting generated. I had discovered that while trying to manually remove the principals. Once I removed those principals forcefully and regenerated the keytabs and the principals from the CM, the issue got resolved.
... View more
07-14-2023
01:17 AM
Hi All, I have developed Spring boot application to read data from hbase via phoenix. the application creats new phoenix connection all the time and due to which high memory usage results in app crash. Please suggest
... View more
04-20-2023
02:28 AM
Hi @willx , But we should have some option to configure table creation of phoenix table with bucket configiration other than prefixrowkey partition. Thanks, Jyothsna
... View more
04-10-2023
08:22 AM
Hi @rahuledavalath, when you migrated HDP to CDP, Were you able to ingest data through Phoenix driver to hbase? Thanks, Jyothsna
... View more
08-02-2022
03:32 AM
Hello @syedshakir , Please let us know what is your cdh version? Case A: If I'm understanding correctly you have a kerberized cluster and the file is at local not on hdfs, so you don't need kerberos authentication. Just refer to below google docs, there are a few ways to do it: https://cloud.google.com/storage/docs/uploading-objects#upload-object-cli Case B: To be honest I never did it so I would try: 1. follow the below document to configure google cloud storage with hadoop: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_gcs_config.html 2. if distcp cannot work then follow this document to configure some properties: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_admin_distcp_secure_insecure.html 3. save the whole output of distcp then upload to here, I can help you to check. Remember to remove the sensitive information (such as hostname, ip) from the logs then you can upload. If the distcp output doesn't contain kerberos related errors then you can enable debug logs then re-run the distcp job and save the new output with debug logs: export HADOOP_ROOT_LOGGER=hadoop.root.logger=Debug,console;export HADOOP_OPTS="-Dsun.security.krb5.debug=true" Thanks, Will
... View more