Support Questions

Num · ‎03-08-2019

Dear all,

Version: Cloudera Express 5.14.2

1 master nodes

7 workers

Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."

When above alert pops up such record were noticed in datanode logs:

ERROR DataNode 
BlockSender.sendChunks() exception: 
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:605)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:789)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:736)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:551)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)

cloudera-scm-agent logs:

2158 Monitor-GenericMonitor throttling_logger ERROR (8 skipped) Error fetching metrics at 'http://datanode02.hadoop:1006/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.14.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return
simplejson.load(opened_url))
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load
return loads(fp.read(),
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
timeout: timed out

Alerts are throwing from specific datanodes, not from all.

What can be the problem here?

Thanks in advance

Panjl

kingpin · ‎04-21-2019

It seems like the DataNode webserver is not responding to CM agent. This could be because of various reasons, from my experience i would suggest you to check for "JVM" Pauses or "Slow BlockReceiver" in the DataNode logs as those are primarliy the reason for slow response from the DN to CM agent. Datanode logs which you have provided give some hint about the slow block receiver but just to be sure check again filter the above mentioned keywords in datanode logs.

Whats the frequency of this alert on CM? is it just a one time alert or persistent ?

You can check your Datanode heap utilization by using the chart. Here's how you can do it CM-> chart builder-> select jvm_max_memory_mb, jvm_heap_used_mb where entityName="<DN-instance"

PDDF_VIGNESH · ‎05-05-2022

its a one time alert, im getting this now

paras · ‎05-12-2022

@PDDF_VIGNESH

See if you are able to get successful response from the agent to the host reported in the logs below:

http://datanode02.hadoop:1006/jmx

Few checks:

1. curl http://datanode02.hadoop:1006/jmx

2. telnet datanode02.hadoop:1006

If this is successful, restart the agent.

If there is issue with response, you need to review DN logs for issues/workarounds suggested previously.

Hope this helps,
Paras
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

PDDF_VIGNESH · ‎05-16-2022

hi,

not able to get the reponse for the curl and telnet command

its says failed to connect

issue frequency is low but like to know the root cause of it

paras · ‎05-19-2022

@PDDF_VIGNESH

I hope you have connected to the port configured for your cluster.Does the URL return successful response on browser? The responses need to be checked only during the time of issue.

The JMX is generated by datanode. Not getting a response means there are either issues with datanode or any network issues with the coomunication.

VidyaSargur · ‎05-25-2022

@PDDF_VIGNESH, did @paras response help you resolve this issue?

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

VidyaSargur · ‎05-16-2022

@PDDF_VIGNESH, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum