Support Questions
Find answers, ask questions, and share your expertise

DATA_NODE_WEB_METRIC_COLLECTION has become bad

New Contributor

Dear all,

 

Version: Cloudera Express 5.14.2


1 master nodes


7 workers

 

Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."

When above alert pops up such record were noticed in datanode logs:

ERROR DataNode 
BlockSender.sendChunks() exception: 
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:605)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:789)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:736)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:551)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)


cloudera-scm-agent logs:

2158 Monitor-GenericMonitor throttling_logger ERROR (8 skipped) Error fetching metrics at 'http://datanode02.hadoop:1006/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.14.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
simplejson.load(opened_url))
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load
return loads(fp.read(),
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
timeout: timed out


Alerts are throwing from specific datanodes, not from all.

 

What can be the problem here?

 

Thanks in advance

Panjl

7 REPLIES 7

Contributor

It seems like the DataNode webserver is not responding to CM agent. This could be because of various reasons, from my experience i would suggest you to check for "JVM" Pauses or "Slow BlockReceiver" in the DataNode logs as those are primarliy the reason for slow response from the DN to CM agent.  Datanode logs which you have provided  give some  hint about the slow block receiver but just to be sure check again filter the above mentioned keywords in datanode logs. 

 

Whats the frequency of this alert on CM? is it just a one time alert or persistent ?

 

You can check your Datanode heap utilization by using the chart. Here's how you can do it CM-> chart builder-> select jvm_max_memory_mb, jvm_heap_used_mb where entityName="<DN-instance"

 

 

 

New Contributor

its a one time alert, im getting this now

Expert Contributor

@PDDF_VIGNESH 

 

See if you are able to get successful response from the agent to the host reported in the logs below:

http://datanode02.hadoop:1006/jmx

Few checks:

1. curl http://datanode02.hadoop:1006/jmx

2. telnet datanode02.hadoop:1006

 

If this is successful, restart the agent.

If there is issue with response, you need to review DN logs for issues/workarounds suggested previously.

 

Hope this helps,
Paras
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

 

New Contributor

hi,

not able to get the reponse for the curl and telnet command 

its says failed to connect

issue frequency is low but like to know the root cause of it

Expert Contributor

@PDDF_VIGNESH 

I hope you have connected to the port configured for your cluster.Does the URL return successful response on browser? The responses need to be checked only during the time of issue.

 

The JMX is generated by datanode. Not getting a response means there are either issues with datanode or any network issues with the coomunication.

 

 

 

Community Manager

@PDDF_VIGNESH, did @paras response help you resolve this issue?



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

Community Manager

@PDDF_VIGNESH, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.