The cluster has 1 management node (Bright Cluster Manager and Ambari server), 2 NameNodes (1 active, 1 passive) and 17 DataNodes and is running Hortonworks HDP 2.3.2 and Ambari 2.1.2.
Each node has a 2 10GbE NICs which are bonded together and jumbo frames (MTU=9000) is enabled on the interfaces.
There are sporadic NodeManager Web UI alerts in Ambari. For all 17 DataNodes we get connection timeouts throughout the day. These timeouts are not correlated with any sort of load on the system, they happen no matter what.
When the connection to port 8042 is successful the connection is is around 5-7ms but when the connections fails I get response times of 5 seconds. Never 3 seconds or 6 seconds, always 5 seconds. For example...
If I let the script run long enough then every DataNode will eventually turn up.
It turns out that this is a DNS issue and the solution is to put
options single-request
in /etc/resolv.conf on all nodes.
This option is described in the man page as such:
single-request (since glibc 2.10)
Sets RES_SNGLKUP in _res.options.Bydefault, glibc performs IPv4andIPv6 lookups in parallel since version 2.9.Some appliance DNS servers cannot handle these queries properly and make the requests time out.This option disables the behavior and makes glibc perform the IPv6andIPv4 requests sequentially (at the cost of some slowdown of the resolving process).