Support Questions

Find answers, ask questions, and share your expertise

DataXceiver threads stuck with high CPU usage

avatar
Contributor

Hi everyone,

I'm using Hadoop 3.1.1 and have encountered an issue: after running the DataNode for a few days, it eventually becomes unresponsive. When inspecting the threads of the DataNode process, I found that many of them are stuck in DataXceiver. I'd like to ask if anyone has encountered this before and if there are any recommended solutions.


[root@dn-27 ~]# top -H -p 74042
top - 17:24:14 up 10 days, 4:05, 1 user, load average: 140.45, 114.30, 110.42
Threads: 792 total, 36 running, 756 sleeping, 0 stopped, 0 zombie
%Cpu(s): 54.7 us, 38.0 sy, 0.0 ni, 7.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 52732768+total, 22929336 free, 29385958+used, 21053875+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 22594056+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56353 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 103:13.94 DataXceiver for
72973 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 70:57.77 DataXceiver for
84061 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 60:03.79 DataXceiver for
11326 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 55:46.58 DataXceiver for
15519 hdfs 20 0 11.3g 8.0g 34484 R 99.9 1.6 31:54.12 DataXceiver for
65962 hdfs 20 0 11.3g 8.0g 34484 R 99.7 1.6 74:41.84 DataXceiver for
56313 hdfs 20 0 11.3g 8.0g 34484 R 99.3 1.6 103:09.39 DataXceiver for
11325 hdfs 20 0 11.3g 8.0g 34484 R 99.0 1.6 55:43.29 DataXceiver for
65919 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 74:40.23 DataXceiver for
20557 hdfs 20 0 11.3g 8.0g 34484 R 98.7 1.6 41:18.60 DataXceiver for
10529 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 150:28.54 DataXceiver for
42962 hdfs 20 0 11.3g 8.0g 34484 R 98.3 1.6 120:37.85 DataXceiver for
10488 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:26.11 DataXceiver for
11909 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 150:27.20 DataXceiver for
57550 hdfs 20 0 11.3g 8.0g 34484 R 98.0 1.6 142:06.13 DataXceiver for
10486 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 150:26.47 DataXceiver for
73028 hdfs 20 0 11.3g 8.0g 34484 R 97.7 1.6 60:37.69 DataXceiver for
11901 hdfs 20 0 11.3g 8.0g 34484 R 97.4 1.6 150:25.12 DataXceiver for
72941 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 70:55.71 DataXceiver for
10887 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.40 DataXceiver for
11360 hdfs 20 0 11.3g 8.0g 34484 R 97.0 1.6 55:43.28 DataXceiver for
10528 hdfs 20 0 11.3g 8.0g 34484 R 96.7 1.6 150:27.95 DataXceiver for
11902 hdfs 20 0 11.3g 8.0g 34484 R 96.4 1.6 150:24.02 DataXceiver for
20521 hdfs 20 0 11.3g 8.0g 34484 R 96.0 1.6 41:20.82 DataXceiver for
22369 hdfs 20 0 11.3g 8.0g 34484 R 95.4 1.6 146:25.16 DataXceiver for
10673 hdfs 20 0 11.3g 8.0g 34484 R 95.0 1.6 55:47.24 DataXceiver for
73198 hdfs 20 0 11.3g 8.0g 34484 R 94.7 1.6 60:36.41 DataXceiver for
24624 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 146:16.92 DataXceiver for
20524 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 41:21.80 DataXceiver for
15472 hdfs 20 0 11.3g 8.0g 34484 R 94.4 1.6 31:54.54 DataXceiver for
72974 hdfs 20 0 11.3g 8.0g 34484 R 93.0 1.6 70:59.92 DataXceiver for
42967 hdfs 20 0 11.3g 8.0g 34484 R 92.1 1.6 120:32.41 DataXceiver for
43053 hdfs 20 0 11.3g 8.0g 34484 R 89.7 1.6 118:03.47 DataXceiver for
49234 hdfs 20 0 11.3g 8.0g 34484 R 87.1 1.6 48:41.65 DataXceiver for
43055 hdfs 20 0 11.3g 8.0g 34484 R 85.8 1.6 117:03.03 DataXceiver for
49932 hdfs 20 0 11.3g 8.0g 34484 R 80.8 1.6 48:38.63 DataXceiver for
78139 hdfs 20 0 11.3g 8.0g 34484 S 1.0 1.6 0:37.71 org.apache.hado
80884 hdfs 20 0 11.3g 8.0g 34484 S 0.7 1.6 0:15.24 VolumeScannerTh
74120 hdfs 20 0 11.3g 8.0g 34484 S 0.3 1.6 0:09.30 jsvc


The part of jstack is at the following :

"DataXceiver for client DFSClient_NONMAPREDUCE_-1324017693_1 at /172.18.0.27:34088 [Sending block BP-354740316-172.18.0.1-1707099547847:blk_2856827749_1783210107]" #278210 daemon prio=5 os_prio=0 tid=0x00007f54481a1000 nid=0x1757c runnable [0x00007f53df2f1000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007aa9383d0> (a sun.nio.ch.Util$3)
- locked <0x00000007aa9383c0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007aa938198> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000007af591cf8> (a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:614)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
--
"DataXceiver for client DFSClient_NONMAPREDUCE_-892667432_1 at /172.18.0.17:57202 [Receiving block BP-354740316-172.18.0.1-1707099547847:blk_2856799086_1783181444]" #268862 daemon prio=5 os_prio=0 tid=0x00007f5448ec8000 nid=0x849a runnable [0x00007f53f3c7b000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007aaaa2cc8> (a sun.nio.ch.Util$3)
- locked <0x00000007aaaa2cb8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007aaaa2c70> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)

2 REPLIES 2

avatar
Super Collaborator

Hi, @allen_chu 

Your jstack shows many DataXceiver threads stuck in epollWait, meaning the DataNode is waiting on slow or stalled client/network I/O. Over time, this exhausts threads and makes the DataNode unresponsive. Please check network health and identify if certain clients (e.g., 172.18.x.x) are holding connections open. Review these configs in hdfs-site.xml: dfs.datanode.max.transfer.threads, dfs.datanode.socket.read.timeout, and dfs.datanode.socket.write.timeout to ensure proper limits and timeouts. Increasing max threads or lowering timeouts often helps. Also monitor for stuck jobs on the client side.

avatar
Rising Star

@allen_chu the possible reason could be the Datanode might be overwhelmed with multiple requests causing the network stall, if its specific this to this node then you may check the network level config  on this host, if not you may need to check at the overall cluster level load or any heavy write operations that is being pushed by the client to the HDFS.