Version: Cloudera Express 5.14.2
1 master nodes
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."
When above alert pops up such record were noticed in datanode logs:
ERROR DataNode BlockSender.sendChunks() exception: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:605) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:789) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:736) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:551) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:748)
2158 Monitor-GenericMonitor throttling_logger ERROR (8 skipped) Error fetching metrics at 'http://datanode02.hadoop:1006/jmx' Traceback (most recent call last): File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.14.2-py2.7.egg/cmf/monitor/generic/metric_collectors.py", line 203, in _collect_and_parse_and_return simplejson.load(opened_url)) File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/simplejson-2.1.2-py2.7-linux-x86_64.egg/simplejson/__init__.py", line 324, in load return loads(fp.read(), File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 602, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) timeout: timed out
Alerts are throwing from specific datanodes, not from all.
What can be the problem here?
Thanks in advance
It seems like the DataNode webserver is not responding to CM agent. This could be because of various reasons, from my experience i would suggest you to check for "JVM" Pauses or "Slow BlockReceiver" in the DataNode logs as those are primarliy the reason for slow response from the DN to CM agent. Datanode logs which you have provided give some hint about the slow block receiver but just to be sure check again filter the above mentioned keywords in datanode logs.
Whats the frequency of this alert on CM? is it just a one time alert or persistent ?
You can check your Datanode heap utilization by using the chart. Here's how you can do it CM-> chart builder-> select jvm_max_memory_mb, jvm_heap_used_mb where entityName="<DN-instance"
See if you are able to get successful response from the agent to the host reported in the logs below:
2. telnet datanode02.hadoop:1006
If this is successful, restart the agent.
If there is issue with response, you need to review DN logs for issues/workarounds suggested previously.
Hope this helps,
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
not able to get the reponse for the curl and telnet command
its says failed to connect
issue frequency is low but like to know the root cause of it
I hope you have connected to the port configured for your cluster.Does the URL return successful response on browser? The responses need to be checked only during the time of issue.
The JMX is generated by datanode. Not getting a response means there are either issues with datanode or any network issues with the coomunication.
@PDDF_VIGNESH, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.