Member since
09-03-2020
126
Posts
7
Kudos Received
0
Solutions
01-09-2026
09:51 PM
@allen_chu FYI ➤ This issue—characterized by high CPU usage, a large number of threads stuck in DataXceiver, and a high load average—is a classic symptom of TCP socket leakage or connection hanging within the HDFS Data Transfer Protocol. ➤ Based on your top output and jstack, here is the detailed breakdown of what is happening and how to resolve it. ➤ Analysis of the Symptoms 1. CPU Saturation (99% per thread): Your top output shows dozens of DataXceiver threads consuming nearly 100% CPU each. This usually indicates that the threads are in a "busy-wait" or spinning state within the NIO epollWait call. 2. Stuck in epollWait: The jstack shows threads sitting in sun.nio.ch.EPollArrayWrapper.epollWait. While this is a normal state for a thread waiting for I/O, in your case, these threads are likely waiting for a packet from a client that has already disconnected or is "half-closed," but the DataNode hasn't timed out the connection. 3. Thread Exhaustion: With 792 threads, your DataNode is approaching its default dfs.datanode.max.transfer.threads limit (usually 4096, but often throttled by OS ulimit). As these threads accumulate, the DataNode loses the ability to accept new I/O requests, becoming unresponsive. ➤ Recommended Solutions 1. Increase Socket Timeouts (Immediate Fix) The most common cause is that the DataNode waits too long for a slow or dead client. You should tighten the transfer timeouts to force these "zombie" threads to close. => Update your hdfs-site.xml: dfs.datanode.socket.write.timeout: Default is often 0 (no timeout) or several minutes. Set this to 300000 (5 minutes). dfs.datanode.socket.reuse.keepalive: Set to true to allow better connection management. dfs.datanode.transfer.socket.send.buffer.size & recv.buffer.size: Ensure these are set to 131072 (128KB) to optimize throughput and prevent stalls. 2. Increase the Max Receiver Threads If your cluster handles high-concurrency workloads (like Spark or HBase), the default thread count might be too low. <property> <name>dfs.datanode.max.transfer.threads</name> <value>16384</value> </property> 3. Check for Network "Half-Closed" Connections Since the threads are stuck in read, it is possible the OS is keeping sockets in CLOSE_WAIT or FIN_WAIT2 states. a.] Check socket status: Run netstat -anp | grep 9866 | awk '{print $6}' | sort | uniq -c. b.] OS Tuning: Adjust the Linux kernel to more aggressively close dead connections. Add these to /etc/sysctl.conf: net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 20 4. Address HDFS-14569 (Software Bug) Hadoop 3.1.1 is susceptible to a known issue where DataXceiver threads can leak during block moves or heavy balancer activity. Issue: DataXceiver fails to exit if a client stops sending data mid-packet but keeps the TCP connection open. Recommendation: If possible, upgrade to Hadoop 3.2.1+ or 3.3.x. These versions contain significantly improved NIO handling and better logic for terminating idle Xceivers. ➤ Diagnostic Step: Finding the "Bad" Clients To identify which clients are causing this, run this command on the DataNode: . netstat -atp | grep DataXceiver | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr This will tell you which IP addresses are holding the most DataXceiver connections. If one specific IP (like a single Spark executor or a specific user's edge node) has hundreds of connections, that client's code is likely not closing DFSClient instances correctly.
... View more
01-09-2026
07:51 AM
Thanks for the suggestion, I will go for distcp because we have hundred thousand of files and "only" several thousand of them must be restored
... View more
11-13-2025
03:14 AM
Hello, Please try using the hdfs mover command. Refer: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Mover_-_A_New_Data_Migration_Tool
... View more
07-02-2025
01:34 PM
hi @rizalt , from your report, you probably have snapshots enabled for this directory, so any delete in this directory will not be 100% effective unless the snapshot is also deleted. deleting the snapshot will make it impossible to recover data if necessary. so, on the namenode webui page, in the "snapshot" tab, check your snapshots.
... View more
12-17-2024
12:41 PM
1 Kudo
@JSSSS The error is this "java.io.IOException: File /user/JS/input/DIC.txt._COPYING_ could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation." All the 3 datanode according to the log are excludeNodes=[192.168.1.81:9866, 192.168.1.125:9866, 192.168.1.8> with replication factor of 3 , writes should succeed to all the 3 datanodes else the write fails. The cluster may have under-replicated or unavailable blocks due to excluded nodes HDFS cannot use these nodes, possibly due to: Disk space issues. Write errors or disk failures. Network connectivity problems between the NameNode and DataNodes. 1. Verify if the DataNodes are live and connected to the NameNode hdfs dfsadmin -report Look for the "Live nodes" and "Dead nodes" section If all 3 DataNodes are excluded, they might show up as dead or decommissioned. Ensure the DataNodes have sufficient disk space for the write operation df -h Look at the HDFS data directories (/hadoop/hdfs/data) If disk space is full, clear unnecessary files or increase disk capacity hdfs dfs -rm -r /path/to/old/unused/files View the list of excluded nodes cat $HADOOP_HOME/etc/hadoop/datanodes.exclude If nodes are wrongly excluded: Remove their entries from datanodes.exclude. Refresh the NameNode to apply changes hdfs dfsadmin -refreshNodes Block Placement Policy: If the cluster has DataNodes with specific restrictions (e.g., rack awareness), verify the block placement policy grep dfs.block.replicator.classname $HADOOP_HOME/etc/hadoop/hdfs-site.xml Default: org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault Happy hadooping
... View more
12-16-2024
02:10 PM
1 Kudo
@divyank Have you resolved this issue if not the issue you're encountering is common when Kerberos is enabled for HDFS, as it introduces authentication requirements that need to be properly configured. Here’s how to diagnose and resolve the problem: 1. Root Cause Analysis When Kerberos is enabled: Authentication: Every interaction with HDFS now requires a Kerberos ticket. Misconfiguration: The HDFS service or client-side configurations may not be aligned with Kerberos requirements. Keytabs: Missing or improperly configured keytab files for the HDFS service or users accessing the service. Browser Access: The HDFS Web UI may not support unauthenticated access unless explicitly configured. 2. Steps to Resolve Step 1: Verify Kerberos Configuration Check the Kerberos principal and keytab file paths for HDFS in Cloudera Manager: Navigate to HDFS Service > Configuration. Look for settings like: hadoop.security.authentication → Should be set to Kerberos. dfs.namenode.kerberos.principal → Should match the principal defined in the KDC. dfs.namenode.keytab.file → Ensure the file exists on the NameNode and has correct permissions. Step 2: Validate Kerberos Ticket Check if the HDFS service has a valid Kerberos ticket: klist -kte /path/to/hdfs.keytab If missing, reinitialize the ticket: kinit -kt /path/to/hdfs.keytab hdfs/<hostname>@<REALM> Test HDFS access from the command line: hdfs dfs -ls / If you get authentication errors, the Kerberos ticket might be invalid. Step 3: Validate HDFS Web UI Access Post-Kerberos, accessing the HDFS Web UI (e.g., http://namenode-host:50070) often requires authentication. By default: Unauthenticated Access: May be blocked. Browser Integration: Ensure your browser is configured for Kerberos authentication or the UI is set to allow unauthenticated users. Enable unauthenticated access in Cloudera Manager (if needed): Go to HDFS Service > Configuration. Search for hadoop.http.authentication.type and set it to simple. Step 4: Review Logs for Errors Check NameNode logs for Kerberos-related errors: less /var/log/hadoop/hdfs/hadoop-hdfs-namenode.log Look for errors like: "GSSException: No valid credentials provided" "Principal not found in the keytab" Step 5: Synchronize Clocks Kerberos is sensitive to time discrepancies. Ensure all nodes in the cluster have synchronized clocks ntpdate <NTP-server> Step 6: Restart Services Restart the affected HDFS services via Cloudera Manager after making changes: Restart NameNode, DataNode, and HDFS services. Test the status of HDFS hdfs dfsadmin -report 3. Confirm Resolution Verify HDFS functionality: Test browsing HDFS via the CLI: hdfs dfs -ls / Access the Web UI to confirm functionality: http://<namenode-host>:50070 If HDFS is working via CLI but not in the Web UI, revisit the Web UI settings in Cloudera Manager to allow browser access or configure browser Kerberos support. 4. Troubleshooting Tips If the issue persists: Check the Kerberos ticket validity with: klist Use the following commands to troubleshoot connectivity: hdfs dfs -mkdir /test hdfs dfs -put <local-file> /test Let me know how it goes or if further guidance is needed!
... View more
12-12-2024
10:08 AM
1 Kudo
@irshan When you add balancer as a role in the HDFS cluster, it indeed will show as not started. So its an expected one. Coming to your main query, it could be possible that when you run the balancer, the balancer threshold could be with in the default percentage of 10, so it won't move the blocks. You may have to reduce the balance threshold and try again.
... View more
12-12-2024
10:02 AM
1 Kudo
@Remme Though the procedure you followed might have helped you, with a larger cluster with TBs of Data, this is not a viable option. In that case, would advise working with Cloudera Support.
... View more
12-12-2024
09:48 AM
1 Kudo
@cc_yang It could be possible you may have enabled HDFS space quota to the directory and the directory may have reached to it hard limit, causing the file upload throws insufficient space message. Refer more about HDFS quota as below. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html
... View more
12-12-2024
09:40 AM
Though one can do the manual intervention to fix the under replicated blocks, HDFS has matured a lot and the NameNode will take care of fixing the under replicated blocks on its own. The drawback for doing the manual step is that it may add additional load to the NameNode Operations and may cause performance degradation with existing jobs. So if you plan to do manually you may do it at least business hours or over the weekend.
... View more