I have been experiencing failures with my datanodes and the error is WRITE_BLOCK and READ_BLOCK. I have checked the data handlers and i have dfs.datanode.max.transfer.threads set to 16384. I run HDP 2.4.3 with 11 nodes. Please see error below;
2017-03-24 10:09:59,749 ERROR datanode.DataNode (DataXceiver.java:run(278)) - dn:50010:DataXceiver error processing READ_BLOCK operation src: /ip_address:49591 dst: /ip_address:50010 2017-03-24 11:02:18,750 ERROR datanode.DataNode (DataXceiver.java:run(278)) - dn:50010:DataXceiver error processing WRITE_BLOCK operation src: /ip_address:43052 dst: /ip_address:50010
In the posted stack trace, there are lot of GC pauses.
Below is a good article explaining Namenode Garbage Collection practices:
for writing issue. if you share more information also gets better understanding.
1) Check Data node is listing in Ambari WI
2) if data node is fine, it may be the jira as below.
Hi @Joshua Adeleke, how frequently do you see the errors. These are some times seen in busy clusters and usually clients/HDFS recover from transient failures.
If there are no job or task failures around the time you of the errors, I would just ignore them.
Edit: I took a look at your attached log file. There's a lot of GC pauses as @Namit Maheshwari pointed out.
Try increasing the DataNode heap size and PermGen/NewGen allocations until the GC pauses go away.
2017-03-25 10:10:18,219 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 44122ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=44419ms
I'd also post this question on the Ambari track to check why Ambari didn't detect the DataNodes doing down.
Also from your logs it is hard to say why the DataNode went down. I again recommend increasing the DataNode heap allocation via Ambari. Also check that your nodes are provisioned with sufficient amount of RAM.