<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: DataXceiver threads stuck with high CPU usage in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412132#M253262</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/118899"&gt;@allen_chu&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your jstack shows many DataXceiver threads stuck in epollWait, meaning the DataNode is waiting on slow or stalled client/network I/O. Over time, this exhausts threads and makes the DataNode unresponsive. Please check network health and identify if certain clients (e.g., 172.18.x.x) are holding connections open. Review these configs in hdfs-site.xml: dfs.datanode.max.transfer.threads, dfs.datanode.socket.read.timeout, and dfs.datanode.socket.write.timeout to ensure proper limits and timeouts. Increasing max threads or lowering timeouts often helps. Also monitor for stuck jobs on the client side.&lt;/P&gt;</description>
    <pubDate>Tue, 19 Aug 2025 08:48:52 GMT</pubDate>
    <dc:creator>RAGHUY</dc:creator>
    <dc:date>2025-08-19T08:48:52Z</dc:date>
    <item>
      <title>DataXceiver threads stuck with high CPU usage</title>
      <link>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412112#M253250</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I'm using Hadoop 3.1.1 and have encountered an issue: after running the DataNode for a few days, it eventually becomes unresponsive. When inspecting the threads of the DataNode process, I found that many of them are stuck in DataXceiver. I'd like to ask if anyone has encountered this before and if there are any recommended solutions.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;[root@dn-27 ~]# top -H -p 74042&lt;BR /&gt;top - 17:24:14 up 10 days, 4:05, 1 user, load average: 140.45, 114.30, 110.42&lt;BR /&gt;Threads: 792 total, 36 running, 756 sleeping, 0 stopped, 0 zombie&lt;BR /&gt;%Cpu(s): 54.7 us, 38.0 sy, 0.0 ni, 7.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st&lt;BR /&gt;KiB Mem : 52732768+total, 22929336 free, 29385958+used, 21053875+buff/cache&lt;BR /&gt;KiB Swap: 0 total, 0 free, 0 used. 22594056+avail Mem&lt;/P&gt;&lt;P&gt;PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND&lt;BR /&gt;56353 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 103:13.94 DataXceiver for&lt;BR /&gt;72973 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 70:57.77 DataXceiver for&lt;BR /&gt;84061 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 60:03.79 DataXceiver for&lt;BR /&gt;11326 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 55:46.58 DataXceiver for&lt;BR /&gt;15519 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 31:54.12 DataXceiver for&lt;BR /&gt;65962 hdfs 20 0 11.3g 8.0g 34484 R 99.7 1.6 74:41.84 DataXceiver for&lt;BR /&gt;56313 hdfs 20 0 11.3g 8.0g 34484 R 99.3 1.6 103:09.39 DataXceiver for&lt;BR /&gt;11325 hdfs 20 0 11.3g 8.0g 34484 R 99.0 1.6 55:43.29 DataXceiver for&lt;BR /&gt;65919 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 74:40.23 DataXceiver for&lt;BR /&gt;20557 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 41:18.60 DataXceiver for&lt;BR /&gt;10529 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 150:28.54 DataXceiver for&lt;BR /&gt;42962 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 120:37.85 DataXceiver for&lt;BR /&gt;10488 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:26.11 DataXceiver for&lt;BR /&gt;11909 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:27.20 DataXceiver for&lt;BR /&gt;57550 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 142:06.13 DataXceiver for&lt;BR /&gt;10486 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 150:26.47 DataXceiver for&lt;BR /&gt;73028 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 60:37.69 DataXceiver for&lt;BR /&gt;11901 hdfs 20 0 11.3g 8.0g 34484 R 97.4 1.6 150:25.12 DataXceiver for&lt;BR /&gt;72941 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 70:55.71 DataXceiver for&lt;BR /&gt;10887 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.40 DataXceiver for&lt;BR /&gt;11360 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.28 DataXceiver for&lt;BR /&gt;10528 hdfs 20 0 11.3g 8.0g 34484 R 96.7 1.6 150:27.95 DataXceiver for&lt;BR /&gt;11902 hdfs 20 0 11.3g 8.0g 34484 R 96.4 1.6 150:24.02 DataXceiver for&lt;BR /&gt;20521 hdfs 20 0 11.3g 8.0g 34484 R 96.0 1.6 41:20.82 DataXceiver for&lt;BR /&gt;22369 hdfs 20 0 11.3g 8.0g 34484 R 95.4 1.6 146:25.16 DataXceiver for&lt;BR /&gt;10673 hdfs 20 0 11.3g 8.0g 34484 R 95.0 1.6 55:47.24 DataXceiver for&lt;BR /&gt;73198 hdfs 20 0 11.3g 8.0g 34484 R 94.7 1.6 60:36.41 DataXceiver for&lt;BR /&gt;24624 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 146:16.92 DataXceiver for&lt;BR /&gt;20524 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 41:21.80 DataXceiver for&lt;BR /&gt;15472 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 31:54.54 DataXceiver for&lt;BR /&gt;72974 hdfs 20 0 11.3g 8.0g 34484 R 93.0 1.6 70:59.92 DataXceiver for&lt;BR /&gt;42967 hdfs 20 0 11.3g 8.0g 34484 R 92.1 1.6 120:32.41 DataXceiver for&lt;BR /&gt;43053 hdfs 20 0 11.3g 8.0g 34484 R 89.7 1.6 118:03.47 DataXceiver for&lt;BR /&gt;49234 hdfs 20 0 11.3g 8.0g 34484 R 87.1 1.6 48:41.65 DataXceiver for&lt;BR /&gt;43055 hdfs 20 0 11.3g 8.0g 34484 R 85.8 1.6 117:03.03 DataXceiver for&lt;BR /&gt;49932 hdfs 20 0 11.3g 8.0g 34484 R 80.8 1.6 48:38.63 DataXceiver for&lt;BR /&gt;78139 hdfs 20 0 11.3g 8.0g 34484 S 1.0 1.6 0:37.71 org.apache.hado&lt;BR /&gt;80884 hdfs 20 0 11.3g 8.0g 34484 S 0.7 1.6 0:15.24 VolumeScannerTh&lt;BR /&gt;74120 hdfs 20 0 11.3g 8.0g 34484 S 0.3 1.6 0:09.30 jsvc&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The part of jstack is at the following :&lt;/P&gt;&lt;P&gt;"DataXceiver for client DFSClient_NONMAPREDUCE_-1324017693_1 at /172.18.0.27:34088 [Sending block BP-354740316-172.18.0.1-1707099547847:blk_2856827749_1783210107]" #278210 daemon prio=5 os_prio=0 tid=0x00007f54481a1000 nid=0x1757c runnable [0x00007f53df2f1000]&lt;BR /&gt;java.lang.Thread.State: RUNNABLE&lt;BR /&gt;at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)&lt;BR /&gt;at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)&lt;BR /&gt;at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)&lt;BR /&gt;at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)&lt;BR /&gt;- locked &amp;lt;0x00000007aa9383d0&amp;gt; (a sun.nio.ch.Util$3)&lt;BR /&gt;- locked &amp;lt;0x00000007aa9383c0&amp;gt; (a java.util.Collections$UnmodifiableSet)&lt;BR /&gt;- locked &amp;lt;0x00000007aa938198&amp;gt; (a sun.nio.ch.EPollSelectorImpl)&lt;BR /&gt;at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)&lt;BR /&gt;at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)&lt;BR /&gt;at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)&lt;BR /&gt;at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)&lt;BR /&gt;at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)&lt;BR /&gt;at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)&lt;BR /&gt;at java.io.BufferedInputStream.read(BufferedInputStream.java:265)&lt;BR /&gt;- locked &amp;lt;0x00000007af591cf8&amp;gt; (a java.io.BufferedInputStream)&lt;BR /&gt;at java.io.FilterInputStream.read(FilterInputStream.java:83)&lt;BR /&gt;at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547)&lt;BR /&gt;at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:614)&lt;BR /&gt;at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)&lt;BR /&gt;--&lt;BR /&gt;"DataXceiver for client DFSClient_NONMAPREDUCE_-892667432_1 at /172.18.0.17:57202 [Receiving block BP-354740316-172.18.0.1-1707099547847:blk_2856799086_1783181444]" #268862 daemon prio=5 os_prio=0 tid=0x00007f5448ec8000 nid=0x849a runnable [0x00007f53f3c7b000]&lt;BR /&gt;java.lang.Thread.State: RUNNABLE&lt;BR /&gt;at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)&lt;BR /&gt;at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)&lt;BR /&gt;at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)&lt;BR /&gt;at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)&lt;BR /&gt;- locked &amp;lt;0x00000007aaaa2cc8&amp;gt; (a sun.nio.ch.Util$3)&lt;BR /&gt;- locked &amp;lt;0x00000007aaaa2cb8&amp;gt; (a java.util.Collections$UnmodifiableSet)&lt;BR /&gt;- locked &amp;lt;0x00000007aaaa2c70&amp;gt; (a sun.nio.ch.EPollSelectorImpl)&lt;BR /&gt;at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)&lt;BR /&gt;at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)&lt;BR /&gt;at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)&lt;/P&gt;</description>
      <pubDate>Thu, 14 Aug 2025 09:37:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412112#M253250</guid>
      <dc:creator>allen_chu</dc:creator>
      <dc:date>2025-08-14T09:37:19Z</dc:date>
    </item>
    <item>
      <title>Re: DataXceiver threads stuck with high CPU usage</title>
      <link>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412132#M253262</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/118899"&gt;@allen_chu&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your jstack shows many DataXceiver threads stuck in epollWait, meaning the DataNode is waiting on slow or stalled client/network I/O. Over time, this exhausts threads and makes the DataNode unresponsive. Please check network health and identify if certain clients (e.g., 172.18.x.x) are holding connections open. Review these configs in hdfs-site.xml: dfs.datanode.max.transfer.threads, dfs.datanode.socket.read.timeout, and dfs.datanode.socket.write.timeout to ensure proper limits and timeouts. Increasing max threads or lowering timeouts often helps. Also monitor for stuck jobs on the client side.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Aug 2025 08:48:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412132#M253262</guid>
      <dc:creator>RAGHUY</dc:creator>
      <dc:date>2025-08-19T08:48:52Z</dc:date>
    </item>
    <item>
      <title>Re: DataXceiver threads stuck with high CPU usage</title>
      <link>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412848#M253727</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/118899"&gt;@allen_chu&lt;/a&gt;&amp;nbsp;the possible reason could be the Datanode might be overwhelmed with multiple requests causing the network stall, if its specific this to this node then you may check the network level config&amp;nbsp; on this host, if not you may need to check at the overall cluster level load or any heavy write operations that is being pushed by the client to the HDFS.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Nov 2025 10:41:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/412848#M253727</guid>
      <dc:creator>sathishkr</dc:creator>
      <dc:date>2025-11-13T10:41:49Z</dc:date>
    </item>
    <item>
      <title>Re: DataXceiver threads stuck with high CPU usage</title>
      <link>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/413312#M253997</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/118899"&gt;@allen_chu&lt;/a&gt;&amp;nbsp;FYI&lt;BR /&gt;&lt;BR /&gt;➤ This issue—characterized by high CPU usage, a large number of threads stuck in DataXceiver, and a high load average—is a classic symptom of TCP socket leakage or connection hanging within the HDFS Data Transfer Protocol.&lt;/P&gt;&lt;P&gt;➤ Based on your top output and jstack, here is the detailed breakdown of what is happening and how to resolve it.&lt;/P&gt;&lt;P&gt;➤ Analysis of the Symptoms&lt;BR /&gt;1. CPU Saturation (99% per thread): Your top output shows dozens of DataXceiver threads consuming nearly 100% CPU each. This usually indicates that the threads are in a "busy-wait" or spinning state within the NIO epollWait call.&lt;BR /&gt;&lt;BR /&gt;2. Stuck in epollWait: The jstack shows threads sitting in sun.nio.ch.EPollArrayWrapper.epollWait. While this is a normal state for a thread waiting for I/O, in your case, these threads are likely waiting for a packet from a client that has already disconnected or is "half-closed," but the DataNode hasn't timed out the connection.&lt;/P&gt;&lt;P&gt;3. Thread Exhaustion: With 792 threads, your DataNode is approaching its default dfs.datanode.max.transfer.threads limit (usually 4096, but often throttled by OS ulimit). As these threads accumulate, the DataNode loses the ability to accept new I/O requests, becoming unresponsive.&lt;/P&gt;&lt;P&gt;➤ Recommended Solutions&lt;/P&gt;&lt;P&gt;1. Increase Socket Timeouts (Immediate Fix)&lt;BR /&gt;The most common cause is that the DataNode waits too long for a slow or dead client. You should tighten the transfer timeouts to force these "zombie" threads to close.&lt;BR /&gt;&lt;BR /&gt;=&amp;gt; Update your hdfs-site.xml:&lt;BR /&gt;&lt;BR /&gt;dfs.datanode.socket.write.timeout: Default is often 0 (no timeout) or several minutes. Set this to 300000 (5 minutes).&lt;BR /&gt;&lt;BR /&gt;dfs.datanode.socket.reuse.keepalive: Set to true to allow better connection management.&lt;BR /&gt;&lt;BR /&gt;dfs.datanode.transfer.socket.send.buffer.size &amp;amp; recv.buffer.size: Ensure these are set to 131072 (128KB) to optimize throughput and prevent stalls.&lt;/P&gt;&lt;P&gt;2. Increase the Max Receiver Threads&lt;BR /&gt;If your cluster handles high-concurrency workloads (like Spark or HBase), the default thread count might be too low.&lt;/P&gt;&lt;P&gt;&amp;lt;property&amp;gt;&lt;BR /&gt;&amp;lt;name&amp;gt;dfs.datanode.max.transfer.threads&amp;lt;/name&amp;gt;&lt;BR /&gt;&amp;lt;value&amp;gt;16384&amp;lt;/value&amp;gt;&lt;BR /&gt;&amp;lt;/property&amp;gt;&lt;/P&gt;&lt;P&gt;3. Check for Network "Half-Closed" Connections&lt;/P&gt;&lt;P&gt;Since the threads are stuck in read, it is possible the OS is keeping sockets in CLOSE_WAIT or FIN_WAIT2 states.&lt;BR /&gt;&lt;BR /&gt;a.] Check socket status: Run netstat -anp | grep 9866 | awk '{print $6}' | sort | uniq -c.&lt;BR /&gt;&lt;BR /&gt;b.] OS Tuning: Adjust the Linux kernel to more aggressively close dead connections. Add these to /etc/sysctl.conf:&lt;BR /&gt;net.ipv4.tcp_keepalive_time = 600&lt;BR /&gt;net.ipv4.tcp_keepalive_intvl = 60&lt;BR /&gt;net.ipv4.tcp_keepalive_probes = 20&lt;/P&gt;&lt;P&gt;4. Address HDFS-14569 (Software Bug)&lt;/P&gt;&lt;P&gt;Hadoop 3.1.1 is susceptible to a known issue where DataXceiver threads can leak during block moves or heavy balancer activity.&lt;BR /&gt;&lt;BR /&gt;Issue: DataXceiver fails to exit if a client stops sending data mid-packet but keeps the TCP connection open.&lt;BR /&gt;&lt;BR /&gt;Recommendation: If possible, upgrade to Hadoop 3.2.1+ or 3.3.x. These versions contain significantly improved NIO handling and better logic for terminating idle Xceivers.&lt;/P&gt;&lt;P&gt;➤ Diagnostic Step: Finding the "Bad" Clients&lt;BR /&gt;To identify which clients are causing this, run this command on the DataNode:&lt;BR /&gt;.&lt;BR /&gt;netstat -atp | grep DataXceiver | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr&lt;BR /&gt;This will tell you which IP addresses are holding the most DataXceiver connections. If one specific IP (like a single Spark executor or a specific user's edge node) has hundreds of connections, that client's code is likely not closing DFSClient instances correctly.&lt;/P&gt;</description>
      <pubDate>Sat, 10 Jan 2026 05:52:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/DataXceiver-threads-stuck-with-high-CPU-usage/m-p/413312#M253997</guid>
      <dc:creator>9een</dc:creator>
      <dc:date>2026-01-10T05:52:28Z</dc:date>
    </item>
  </channel>
</rss>

