We appear to be having a problem where throughout the day NAME_NODE_RPC_LATENCY alerts will trigger then clear right after.
This is one from this morning:
NAME_NODE_RPC_LATENCY has become bad: The moving average of the RPC latency is 11.7 second(s) over the previous 5 minute(s). The moving average of the queue time is 5.3 second(s). The moving average of the processing time is 6.4 second(s). Critical threshold: 5 second(s).
I have not been able to find anything in my books or on the internet about how to mitigate this (other than perhaps adding more namenodes using HDFS federation). When I look at the server running the namenode I am not seeing any heavy utililzation of CPU nor am I seeing memory being strained. The only indication of anything in this time frame is a network spike.
Is there anything I can do to mitigate this problem ? We do not have jumbo frames enabled if that makes a difference.