Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DATA_NODE_WEB_METRIC_COLLECTION has become bad

DATA_NODE_WEB_METRIC_COLLECTION has become bad

New Contributor

Dear all,

 

Version: Cloudera Express 5.14.2


1 master nodes


7 workers

 

Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."

When above alert pops up such record were noticed in datanode logs:

ERROR DataNode 
BlockSender.sendChunks() exception: 
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:605)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:789)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:736)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:551)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)


cloudera-scm-agent logs:

2158 Monitor-GenericMonitor throttling_logger ERROR (8 skipped) Error fetching metrics at 'http://datanode02.hadoop:1006/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.14.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
simplejson.load(opened_url))
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load
return loads(fp.read(),
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
timeout: timed out


Alerts are throwing from specific datanodes, not from all.

 

What can be the problem here?

 

Thanks in advance

Panjl

1 REPLY 1
Highlighted

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

Cloudera Employee

It seems like the DataNode webserver is not responding to CM agent. This could be because of various reasons, from my experience i would suggest you to check for "JVM" Pauses or "Slow BlockReceiver" in the DataNode logs as those are primarliy the reason for slow response from the DN to CM agent.  Datanode logs which you have provided  give some  hint about the slow block receiver but just to be sure check again filter the above mentioned keywords in datanode logs. 

 

Whats the frequency of this alert on CM? is it just a one time alert or persistent ?

 

You can check your Datanode heap utilization by using the chart. Here's how you can do it CM-> chart builder-> select jvm_max_memory_mb, jvm_heap_used_mb where entityName="<DN-instance"