About willx

willx · ‎03-04-2025

You may want to collect yarn application log to understand what happened after 3 hours, for example, it may be a yarn resource issue or stuck containers. 1. Open console debug log and re-run distcp and save the output export HADOOP_ROOT_LOGGER=DEBUG,console nohup hadoop distcp -Dfs.s3a.access.key="$AWS_ACCESS_KEY_ID" -Dfs.s3a.secret.key="$AWS_SECRET_ACCESS_KEY" -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=1048576 -Dfs.s3a.multipart.size=10485760 -Dfs.s3a.multipart.threshold=10485760 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7360m -m=300 -bandwidth 400 -update [hdfs path] [s3a path] > distcp_console.out 2>&1 & 2. Collect yarn application logs: yarn logs -applicationId [applicationID] > /tmp/distcp_application.out 3. If there are stuck yarn containers, collect jstack of the container pid, refer to below post https://my.cloudera.com/knowledge/How-to-collect-thread-dumps-for-stuck-YARN-containers-via-jstack?id=73696

willx · ‎11-12-2024

Does the below process folder and the file inside of it exist? The ERROR is file not found. [12/Nov/2024 06:03:16 +0000] 1559 __run_queue process ERROR Error creating marker /var/run/cloudera-scm-agent/process/1546503323-hbase-REGIONSERVER/process_timestamp Traceback (most recent call last): File "/opt/cloudera/cm-agent/lib/python3.8/site-packages/cmf/process.py", line 1302, in mark_orphan f = open(marker, 'w') FileNotFoundError: [Errno 2] No such file or directory: '/var/run/cloudera-scm-agent/process/1546503323-hbase-REGIONSERVER/process_timestamp' Try to restart cloudera-scm-agent service and then restart RegionServer from CM. If it still doesn't work could you please try the workarounds again?

willx · ‎11-11-2024

Hi @sayebogbon , Could you please try to remove the config files from "/var/run/cloudera-scm-agent/supervisor/include". 1. Rename process dir from "/var/run/cloudera-scm-agent/process" 2. Delete orphan process dir soft link from "/var/run/cloudera-scm-agent/supervisor/include" 3. Kill the running services process kill -9 pid 4. Restart CM agent and Stop the services from CM server. 5. Start the services from CM again. A new process dir and pid shall be created by agent.

willx · ‎05-13-2024

Hi @NaveenBlaze, Thanks for raising the question. Please refer to the below article for the flow of checkpointing in HDFS. https://blog.cloudera.com/a-guide-to-checkpointing-in-hadoop/ Regards, Will Xiao, Cloudera support

willx · ‎04-18-2024

Introduction Apache Hadoop's efficiency and reliability significantly depend on the Java Virtual Machine (JVM)'s performance. To assist in monitoring and analyzing JVM pauses, especially those induced by Garbage Collection (GC), Hadoop integrates a utility named JvmPauseMonitor, a Hadoop common class introduced by https://issues.apache.org/jira/browse/HADOOP-9618 and widely used in Hadoop projects like HDFS, Hbase, Yarn, Oozie, MapReduce, etc. Overview of JvmPauseMonitor Link to reference code here Key Features and Implementation private String formatMessage(long extraSleepTime, Map<String, GcTimes> gcTimesAfterSleep, Map<String, GcTimes> gcTimesBeforeSleep) { Set<String> gcBeanNames = Sets.intersection( gcTimesAfterSleep.keySet(), gcTimesBeforeSleep.keySet()); List<String> gcDiffs = Lists.newArrayList(); for (String name : gcBeanNames) { GcTimes diff = gcTimesAfterSleep.get(name).subtract( gcTimesBeforeSleep.get(name)); if (diff.gcCount != 0) { gcDiffs.add("GC pool '" + name + "' had collection(s): " + diff.toString()); } } String ret = "Detected pause in JVM or host machine (eg GC): " + "pause of approximately " + extraSleepTime + "ms\n"; if (gcDiffs.isEmpty()) { ret += "No GCs detected"; } else { ret += Joiner.on("\n").join(gcDiffs); } return ret; } Monitoring Cycle: It executes a sleep-wake cycle, periodically sleeping through the Thread.sleep(SLEEP_INTERVAL_MS) call, comparing the intended sleep duration against the actual elapsed time to identify JVM pauses. SLEEP_INTERVAL_MS is defined as: private static final long SLEEP_INTERVAL_MS = 500; Evaluation of GC Metrics: It assesses garbage collection metrics before and after the sleep interval to determine whether GC activities contributed to the detected pauses. Compute Diff: diff is computed by subtracting the GC metrics captured before the sleep (gcTimesBeforeSleep) from those after the sleep (gcTimesAfterSleep) for each garbage collector identified by name. Determine GC type: If GC events happened, the diff.gcCount variable will be positive, then logged as "GC pause ". If no GC events are happening during the sleep interval, the diff.gcCount variable will be zero, then it is "No GC" Logging Thresholds: JvmPauseMonitor uses configurable thresholds (INFO_THRESHOLD_DEFAULT = 1000 for info-level logs and WARN_THRESHOLD_DEFAULT = 10000 for warn-level logs) to categorize the severity of detected pauses. Understanding "GC" and "No GC" "GC" Detected: If the difference (diff) in GC metrics (count and time) before and after the pause indicates an increase, it signifies that GC events occurred during the pause. These events are detailed in the log message, specifying which GC pools were active and the extent of their activity. if (diff.gcCount != 0) { gcDiffs.add("GC pool '" + name + "' had collection(s): " + diff.toString()); } "No GC" Detected: Conversely, if no significant differences in GC metrics are found (i.e., gcDiffs is empty), the pause is logged without attributing it to GC activities. This scenario suggests the pause was caused by factors other than garbage collection, such as OS-level scheduling delays. if (gcDiffs.isEmpty()) { ret += "No GCs detected"; } GC logging/debug: Add the below parameters to JVM params, which usually can be added to java_opts in the Role's Configuration in CM. Basic logging parameters for performance tuning: -verbose:gc -XX:-PrintGCCause -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps Additional parameters that may be used for more deep evaluation: -XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -XX:+PrintReferenceGC -XX:+PrintTenuringDistribution Either you can analyze the collected GC logs directly or use some tools like GCEasy(https://gceasy.io/) Possible Reasons Possible reasons for "GC" pauses inside JVMs are Heap Size and Utilization problem Improper GC parameters JVM threads are overloaded Known bugs or memory leaks Many articles explain JVM GC issues, so we discuss it less here. Possible reasons for the "No GC" pause: It implies that garbage collection activities didn't cause the pause. Several factors outside of GC can lead to such pauses Operating System Pauses The application or the JVM process might be paused due to the operating system's actions, such as swapping memory to disk, other processes consuming excessive CPU resources, or OS-level maintenance tasks. Using uptime , iostat , sar to find the OS utilization pattern. Check the /var/log/message and `dmesg -T` for warnings from kernel Cloudera Manager host charts, e.g. load average, Disk, and Network utilization. Review the Diagnostic Bundle Alerts and health checks from CSI Hardware Issues Underlying hardware problems, such as failing disks, network issues, or memory errors, can lead to pauses in the JVM as it waits for IO operations or encounters hardware errors. External System Calls Calls to external systems or services, especially those that involve network communication or disk access, can introduce delays. If these calls block the main application threads, they can result in pauses not attributed to GC. Example Case of a "No GC" pause that causes Namenode to crash Log snippets: "No GC pause happened" 2024-02-25 21:31:27,854 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7861ms No GCs detected 2024-02-25 21:31:35,704 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7350ms No GCs detected NN is not responsive to sendEdits through QuorumjournalManager 2024-02-25 21:31:36,703 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8849 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2024-02-25 21:31:37,703 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9850 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2024-02-25 21:31:38,704 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10851 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2024-02-25 21:31:39,706 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11852 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. NN is fenced 2024-02-25 21:31:43,517 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal 2024-02-25 21:32:13,947 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: Charts The "No GC" is due to a sudden write I/O on /root directory (sys disk) observed from Grafana: (Chart 1) Disk utilization of directories (Chart 2) Disk write I/O capacity per second (Chart 3) Disk write IOPS (Chart 4) Disk I/O operation time in 1 second We found there's a cron job that runs "hdfs oiv" command to parse fsimage into fsimage.csv but it is generating the big csv file at /root directory which is the system disk causing the sudden I/O peak further causing the "No GC" Reference: Garbage Collection Pauses in Namenode and Datanode

willx · ‎09-14-2023

Some steps to help narrow down the issue, if possible please attach outputs or the answer to the below items: 1. In order to know if the issue is from CM or HDFS, can you please check if the Standby NN's process is existing or not by running: ps -ef|grep -i namenode 2. Please check do we have any ERROR/WARN in the latest Standby NN, Are there any GC pause issues detected in the Standby NN's log? Attaching the errors may help us know the issue better. 3. Please check the status of cloudera-scm-agent by running the below commands, make sure the agent is Active: systemctl status cloudera-scm-agent 4. How about the other services on this host, they are all good but only SNN has this issue? 5. Please try to open NN webUI and SNN webUI from the browser, if SNN is up and running, the webUI should be good: The default webUI port is 9870 http://NN_ip:9870/dfshealth.html#tab-overview 6. Please check if cpu utilization and memory utilization are sufficient in this SNN host. 7. When and how did this issue happen, did it happen after the restarting?

willx · ‎08-28-2023

Hi @Srinivas-M , Questions: - What are the current encryption types? - What is the JDK version? - Are other services running well? hdfs/hbase/yarn etc. You can try following steps: - Did you try to kinit a keytab from the latest process directory of zookeeper (/var/run/cloudera-scm-agent/process/<latest_process_folder_of_zookeeper>/zookeeper.keytab) - Try to re-generate keytab and principals via CM and restart the zookeeper. A similar issue was in this KB. https://my.cloudera.com/knowledge/ERROR-quot-java-io-IOException-Could-not-configure-server?id=273635

willx · ‎04-20-2023

@Sindhu6 Please refer to this phoenix doc for java example and url syntax: https://phoenix.apache.org/faq.html#What_is_the_Phoenix_JDBC_URL_syntax An example URL is jdbc:phoenix:thin:url=http://hostname:8765;serialization=PROTOBUF;authentication=SPNEGO; principal=hbase/hostname@EXAMPLE; keytab=/var/run/cloudera-scm-agent/process/xx-hbase-REGIONSERVER/hbase.keytab (due to system will hide long text, devide into multiple lines, but you need write url into single line.) Jar is /opt/cloudera/parcels/CDH-7.1.x/jars/phoenix-queryserver-client-xxx.jar Driver Class is "org.apache.phoenix.queryserver.client.Driver" You should replace keytab, principle, jar with your own and start to test single java class instead of complex projects.

willx · ‎04-19-2023

Please check if your principal and keytab are set correctly. Here is another example: thick client: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/phoenix_using.html thin client: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/phoenix_thin_client_pqs.html

willx · ‎04-18-2023

Hi @Sindhu6, Please make sure phoenix and hbase are functional by accessing phoenix-sqlline and creating a test table then select data from it prior to using jdbc. Then please refer to the below doc for phoenix jdbc usage in cdp: https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/phoenix-access-data/topics/phoenix-orchestrating-sql.html

Online	Offline
Last Visited	‎10-22-2025 08:38 PM

Member Since	‎10-03-2020 06:12 AM
Last Visited	‎10-22-2025 08:38 PM
Posts	236
Kudos received	14

Cloudera Community

Re: Datanode and Impala Daemon Instances Show Unkn...

Re: Services not starting up after Enabling Kerber...

Re: What is the difference between volumes and fol...

Re: Hbase labels table creation

Re: All Hdfs file names older than N days

Re: Migrating HDFS data in S3

Re: Datanode and Impala Daemon Instances Show Unkn...

Re: Datanode and Impala Daemon Instances Show Unkn...

Re: HDFS cluster in HA enabled, during check point...

Comprehensive understanding of "No GC" pauses in h...

Re: As of HDFS CM5.12.1, the standby namenode is b...

Re: Services not starting up after Enabling Kerber...

Re: Unable to connect to CDP phoenix from Spring B...

Re: Unable to connect to CDP phoenix from Spring B...

Re: Unable to connect to CDP phoenix from Spring B...