Support Questions

Find answers, ask questions, and share your expertise

Agents unable to contact Host-Monitor for avro schema errors

avatar
Explorer

Hi,

 

I'm having an issue with multiple cloudera manager agents on a CDH 5.2 cluster.
The error we are seing on the CM web-interface is a generic one: This host is in contact with Cloudera Manager. The host's Cloudera Manager Agent's software version can not be determined

The issue is not permanent and randomly comes and goes.

 

In the log files (/var/log/cloudera-scm-agent/cloudera-scm.agent.log) the daemon prints a lot of these messages:

 

[04/Sep/2016 17:40:54 +0000] 6757 MonitorDaemon-Reporter throttling_logger ERROR    (9 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-398eb4f15a6b55c56ba3c74ad84d8633
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 75, in _send
    self._requestor.request('sendAgentMessages', dict(messages=messages))
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 135, in request
    self.write_call_request(message_name, request_datum, buffer_encoder)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 173, in write_call_request
    self.write_request(message.request, request_datum, encoder)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 177, in write_request
    datum_writer.write(request_datum, encoder)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/io.py", line 768, in write
    raise AvroTypeException(self.writers_schema, datum)

    ... formatted python dictionary ...

    is not an example of the schema [... whole avro schema...]

From my understanding the agent fails to serialize the data collected from the host to avro and can't update the host monitor.

 

On these machines I also have a know-issue on the reported speed on the NICs: according to the OS i have 10Gbits interfaces that are doing hundreds of GiB/s (an obvious bug in the OS itself or on the NIC's fw/driver).

Using the data from the "Last HMON status" from the agent's web-ui I've discoverd this "strange" coincidence: when an agent is experiencing the issue there's at least one NIC with a metric value > than the max long value:

 

{'iface': 'bond0',
 'metrics': [ ....
             {'id': 11130,
              'value': 9435474096333102400L},
              ... ],
}

I don't know exaclty what these metric measure, maybe the bytes sent/received in the last minute? Netherless these numbers, compared with other host w/o the nic issue, are exceptionally high.

Now, in python this isn't a problem because you basically can't overflow an int/long, but maybe the error above happens when the agent can't convert this very big number to a 64bit long in avro (9435474096333102400 is bigger than 9223372036854775807 = 2^63-1). I'm not sure about this because I can't really understand the avro schema and I don't know if the expected type for value is a long.

 

What do you guys think? Has someone experienced anything like this?


And bonus question: Is it possible to blacklist the bugged network interfaces from the agent statistics?


Thanks,
p

1 ACCEPTED SOLUTION

avatar
Explorer

In the end we managed to solve this excluding the problematic network interface from the agent monitoring.

Cloudera Manager indeed has an option to do than in the hosts configuration section. For the nic it's called Network Interface Collection Exclusion Regex (by default only the loopback interface is excluded).

 

@ammolitor

For the disks there are two options: Disk Device Collection Exclusion Regex and Filesystem Collection Exclusion Regex.
Maybe one of these does the trick for you...

 

 

View solution in original post

9 REPLIES 9

avatar
Explorer
Seeing similar agent/python error with a large filesystem mounted. any workarounds found yet?

avatar
Explorer

In the end we managed to solve this excluding the problematic network interface from the agent monitoring.

Cloudera Manager indeed has an option to do than in the hosts configuration section. For the nic it's called Network Interface Collection Exclusion Regex (by default only the loopback interface is excluded).

 

@ammolitor

For the disks there are two options: Disk Device Collection Exclusion Regex and Filesystem Collection Exclusion Regex.
Maybe one of these does the trick for you...

 

 

avatar
Explorer

fantastic, thanks!

avatar
Master Guru

ammolitor,

 

Cloudera has fixed this Cloudera Manager/Agent bug (Jira OPSAPS-35742) and the fix will be in the next possible releases of 5.5.x and up.

 

For now, the workaround is to remove the device that is large as the agent code will look at the device regardless of the exclusions.  You can still give it a try, though.

 

parnigot, this seems to be a new manifestation of the same problem we saw with large file system size.  I'll open a new Jira for this as I don't think we have gotten a report of this at the interface level before.  Great find on the workaround, too.  Glad that works for the interface.

 

Regards,

 

Ben

 

avatar
Master Guru

@ammolitor, The difference between yours and the problem that @parnigot is seeing is that the large filesystem size is reported directly to Cloudera Manager via the agent's heartbeat.  That cannot be excluded via configuration, so unmounting the file system would be the answer there until the fix is available in an up-coming release.

 

@parnigot, since your issue occurred (just noted the full stack you provided) when the agent is reporting metrics to the Host Monitor, the metric collection for that interface can be excluded via Network Interface Collection Exclusion Regex

 

Even though the NIC's metrics seem to be misreported, I have opened an internal Cloudera Jira, OPSAPS-37261, so we can consider how to prevent this sort of thing from causing problems for the agent.

 

Thanks for the very detailed information!

 

Ben

avatar
Explorer

@bgooley is this coming soon?

 

This large filesystem is THE filesystem for my cluster, unmounting it is not an option in this case.  Is there any other workaround?

avatar
Master Guru

Sorry, no other workaround I think think of other than altering the code in "filesystem_map.py" (which I would not recommend).

 

The only version of Cloudera Manager that has the fix at this time is 5.7.4.  If you are on a previous release, then you can upgrade CM and agents to get the fix.

 

 

Regards,

 

Ben

avatar
Explorer

editing config.ini seemed to get us where we need to be. 

 

Specifically we removed nfs and nfs4 from monitored_nodev_filesystem_types

 

sed -i 's/nfs,nfs4,//g' /etc/cloudera-scm/agent/config.ini

avatar
Master Guru

Awesome!  I thought I had tested that, but apparently not.  If your agent is heartbeating now, sounds like a good workaround till you can upgrade.

 

I checked and CM 5.8.3 should also have a fix when it is released.  It has not gone to code freeze yet, so we are weeks out yet on that.

 

Thanks for sharing!