Created on 09-08-2016 12:36 AM - edited 09-16-2022 03:38 AM
Hi,
I'm having an issue with multiple cloudera manager agents on a CDH 5.2 cluster.
The error we are seing on the CM web-interface is a generic one: This host is in contact with Cloudera Manager. The host's Cloudera Manager Agent's software version can not be determined
The issue is not permanent and randomly comes and goes.
In the log files (/var/log/cloudera-scm-agent/cloudera-scm.agent.log) the daemon prints a lot of these messages:
[04/Sep/2016 17:40:54 +0000] 6757 MonitorDaemon-Reporter throttling_logger ERROR (9 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-398eb4f15a6b55c56ba3c74ad84d8633 Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 75, in _send self._requestor.request('sendAgentMessages', dict(messages=messages)) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 135, in request self.write_call_request(message_name, request_datum, buffer_encoder) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 173, in write_call_request self.write_request(message.request, request_datum, encoder) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 177, in write_request datum_writer.write(request_datum, encoder) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/io.py", line 768, in write raise AvroTypeException(self.writers_schema, datum) ... formatted python dictionary ... is not an example of the schema [... whole avro schema...]
From my understanding the agent fails to serialize the data collected from the host to avro and can't update the host monitor.
On these machines I also have a know-issue on the reported speed on the NICs: according to the OS i have 10Gbits interfaces that are doing hundreds of GiB/s (an obvious bug in the OS itself or on the NIC's fw/driver).
Using the data from the "Last HMON status" from the agent's web-ui I've discoverd this "strange" coincidence: when an agent is experiencing the issue there's at least one NIC with a metric value > than the max long value:
{'iface': 'bond0', 'metrics': [ .... {'id': 11130, 'value': 9435474096333102400L}, ... ], }
I don't know exaclty what these metric measure, maybe the bytes sent/received in the last minute? Netherless these numbers, compared with other host w/o the nic issue, are exceptionally high.
Now, in python this isn't a problem because you basically can't overflow an int/long, but maybe the error above happens when the agent can't convert this very big number to a 64bit long in avro (9435474096333102400 is bigger than 9223372036854775807 = 2^63-1). I'm not sure about this because I can't really understand the avro schema and I don't know if the expected type for value is a long.
What do you guys think? Has someone experienced anything like this?
And bonus question: Is it possible to blacklist the bugged network interfaces from the agent statistics?
Thanks,
p
Created 10-11-2016 12:00 AM
In the end we managed to solve this excluding the problematic network interface from the agent monitoring.
Cloudera Manager indeed has an option to do than in the hosts configuration section. For the nic it's called Network Interface Collection Exclusion Regex (by default only the loopback interface is excluded).
For the disks there are two options: Disk Device Collection Exclusion Regex and Filesystem Collection Exclusion Regex.
Maybe one of these does the trick for you...
Created 10-10-2016 10:00 PM
Created 10-11-2016 12:00 AM
In the end we managed to solve this excluding the problematic network interface from the agent monitoring.
Cloudera Manager indeed has an option to do than in the hosts configuration section. For the nic it's called Network Interface Collection Exclusion Regex (by default only the loopback interface is excluded).
For the disks there are two options: Disk Device Collection Exclusion Regex and Filesystem Collection Exclusion Regex.
Maybe one of these does the trick for you...
Created 10-11-2016 08:19 AM
fantastic, thanks!
Created 10-11-2016 08:41 AM
ammolitor,
Cloudera has fixed this Cloudera Manager/Agent bug (Jira OPSAPS-35742) and the fix will be in the next possible releases of 5.5.x and up.
For now, the workaround is to remove the device that is large as the agent code will look at the device regardless of the exclusions. You can still give it a try, though.
parnigot, this seems to be a new manifestation of the same problem we saw with large file system size. I'll open a new Jira for this as I don't think we have gotten a report of this at the interface level before. Great find on the workaround, too. Glad that works for the interface.
Regards,
Ben
Created 10-11-2016 08:59 AM
@ammolitor, The difference between yours and the problem that @parnigot is seeing is that the large filesystem size is reported directly to Cloudera Manager via the agent's heartbeat. That cannot be excluded via configuration, so unmounting the file system would be the answer there until the fix is available in an up-coming release.
@parnigot, since your issue occurred (just noted the full stack you provided) when the agent is reporting metrics to the Host Monitor, the metric collection for that interface can be excluded via Network Interface Collection Exclusion Regex
Even though the NIC's metrics seem to be misreported, I have opened an internal Cloudera Jira, OPSAPS-37261, so we can consider how to prevent this sort of thing from causing problems for the agent.
Thanks for the very detailed information!
Ben
Created 10-11-2016 10:04 AM
@bgooley is this coming soon?
This large filesystem is THE filesystem for my cluster, unmounting it is not an option in this case. Is there any other workaround?
Created 10-11-2016 10:25 AM
Sorry, no other workaround I think think of other than altering the code in "filesystem_map.py" (which I would not recommend).
The only version of Cloudera Manager that has the fix at this time is 5.7.4. If you are on a previous release, then you can upgrade CM and agents to get the fix.
Regards,
Ben
Created 10-11-2016 12:12 PM
editing config.ini seemed to get us where we need to be.
Specifically we removed nfs and nfs4 from monitored_nodev_filesystem_types
sed -i 's/nfs,nfs4,//g' /etc/cloudera-scm/agent/config.ini
Created 10-11-2016 12:22 PM
Awesome! I thought I had tested that, but apparently not. If your agent is heartbeating now, sounds like a good workaround till you can upgrade.
I checked and CM 5.8.3 should also have a fix when it is released. It has not gone to code freeze yet, so we are weeks out yet on that.
Thanks for sharing!