Every once in a while i receive a "Stale" alert from DataNode Health Summary alert.
It appears that some DataNodes, every now and then suffer from a spike (over 30 seconds) in sending heartbeat to the NN as seen in the "Last Contact" column in the DataNode Information (which is in the NN UI) - which results in a "stale" alert.
What can cause these spikes ?
Thanks in advance !
Please check if all your nodes are in the same network segment.
This intermittent problem is usually due to network issues. Check the MTU
How to check and setup the MTU for my network interface.
MTU (Maximum Transmission Unit) is related to TCP/IP networking in Linux
Check the current MTU setting
$ ip link list
The default is usually 1500
To make the setting permanent for eth0, edit the configuration file /etc/sysconfig/network-scripts/ifcfg-ethx (Red Hat Linux ) /etc/sysconfig/network-scripts/ifcfg-eth(x) (Red Hat Linux )
DEVICE=eth0 BOOTPROTO=static BROADCAST=192.168.1.255 HWADDR=00:0F:EA:91:04:07 IPADDR=192.168.1.111 NETMASK=255.255.255.0 NETWORK=192.168.1.0 MTU=1400 ONBOOT=yes TYPE=Ethernet
Save the file and restart network service If you are using Redhat:
# service network restart