Created 08-14-2025 02:37 AM
Hi everyone,
I'm using Hadoop 3.1.1 and have encountered an issue: after running the DataNode for a few days, it eventually becomes unresponsive. When inspecting the threads of the DataNode process, I found that many of them are stuck in DataXceiver. I'd like to ask if anyone has encountered this before and if there are any recommended solutions.
[root@dn-27 ~]# top -H -p 74042
top - 17:24:14 up 10 days, 4:05, 1 user, load average: 140.45, 114.30, 110.42
Threads: 792 total, 36 running, 756 sleeping, 0 stopped, 0 zombie
%Cpu(s): 54.7 us, 38.0 sy, 0.0 ni, 7.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 52732768+total, 22929336 free, 29385958+used, 21053875+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 22594056+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56353 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 103:13.94 DataXceiver for
72973 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 70:57.77 DataXceiver for
84061 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 60:03.79 DataXceiver for
11326 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 55:46.58 DataXceiver for
15519 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 31:54.12 DataXceiver for
65962 hdfs 20 0 11.3g 8.0g 34484 R 99.7 1.6 74:41.84 DataXceiver for
56313 hdfs 20 0 11.3g 8.0g 34484 R 99.3 1.6 103:09.39 DataXceiver for
11325 hdfs 20 0 11.3g 8.0g 34484 R 99.0 1.6 55:43.29 DataXceiver for
65919 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 74:40.23 DataXceiver for
20557 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 41:18.60 DataXceiver for
10529 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 150:28.54 DataXceiver for
42962 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 120:37.85 DataXceiver for
10488 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:26.11 DataXceiver for
11909 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:27.20 DataXceiver for
57550 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 142:06.13 DataXceiver for
10486 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 150:26.47 DataXceiver for
73028 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 60:37.69 DataXceiver for
11901 hdfs 20 0 11.3g 8.0g 34484 R 97.4 1.6 150:25.12 DataXceiver for
72941 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 70:55.71 DataXceiver for
10887 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.40 DataXceiver for
11360 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.28 DataXceiver for
10528 hdfs 20 0 11.3g 8.0g 34484 R 96.7 1.6 150:27.95 DataXceiver for
11902 hdfs 20 0 11.3g 8.0g 34484 R 96.4 1.6 150:24.02 DataXceiver for
20521 hdfs 20 0 11.3g 8.0g 34484 R 96.0 1.6 41:20.82 DataXceiver for
22369 hdfs 20 0 11.3g 8.0g 34484 R 95.4 1.6 146:25.16 DataXceiver for
10673 hdfs 20 0 11.3g 8.0g 34484 R 95.0 1.6 55:47.24 DataXceiver for
73198 hdfs 20 0 11.3g 8.0g 34484 R 94.7 1.6 60:36.41 DataXceiver for
24624 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 146:16.92 DataXceiver for
20524 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 41:21.80 DataXceiver for
15472 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 31:54.54 DataXceiver for
72974 hdfs 20 0 11.3g 8.0g 34484 R 93.0 1.6 70:59.92 DataXceiver for
42967 hdfs 20 0 11.3g 8.0g 34484 R 92.1 1.6 120:32.41 DataXceiver for
43053 hdfs 20 0 11.3g 8.0g 34484 R 89.7 1.6 118:03.47 DataXceiver for
49234 hdfs 20 0 11.3g 8.0g 34484 R 87.1 1.6 48:41.65 DataXceiver for
43055 hdfs 20 0 11.3g 8.0g 34484 R 85.8 1.6 117:03.03 DataXceiver for
49932 hdfs 20 0 11.3g 8.0g 34484 R 80.8 1.6 48:38.63 DataXceiver for
78139 hdfs 20 0 11.3g 8.0g 34484 S 1.0 1.6 0:37.71 org.apache.hado
80884 hdfs 20 0 11.3g 8.0g 34484 S 0.7 1.6 0:15.24 VolumeScannerTh
74120 hdfs 20 0 11.3g 8.0g 34484 S 0.3 1.6 0:09.30 jsvc
The part of jstack is at the following :
"DataXceiver for client DFSClient_NONMAPREDUCE_-1324017693_1 at /172.18.0.27:34088 [Sending block BP-354740316-172.18.0.1-1707099547847:blk_2856827749_1783210107]" #278210 daemon prio=5 os_prio=0 tid=0x00007f54481a1000 nid=0x1757c runnable [0x00007f53df2f1000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007aa9383d0> (a sun.nio.ch.Util$3)
- locked <0x00000007aa9383c0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007aa938198> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000007af591cf8> (a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:614)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
--
"DataXceiver for client DFSClient_NONMAPREDUCE_-892667432_1 at /172.18.0.17:57202 [Receiving block BP-354740316-172.18.0.1-1707099547847:blk_2856799086_1783181444]" #268862 daemon prio=5 os_prio=0 tid=0x00007f5448ec8000 nid=0x849a runnable [0x00007f53f3c7b000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007aaaa2cc8> (a sun.nio.ch.Util$3)
- locked <0x00000007aaaa2cb8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007aaaa2c70> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
Created 08-19-2025 01:48 AM
Hi, @allen_chu
Your jstack shows many DataXceiver threads stuck in epollWait, meaning the DataNode is waiting on slow or stalled client/network I/O. Over time, this exhausts threads and makes the DataNode unresponsive. Please check network health and identify if certain clients (e.g., 172.18.x.x) are holding connections open. Review these configs in hdfs-site.xml: dfs.datanode.max.transfer.threads, dfs.datanode.socket.read.timeout, and dfs.datanode.socket.write.timeout to ensure proper limits and timeouts. Increasing max threads or lowering timeouts often helps. Also monitor for stuck jobs on the client side.