Scanning hbase:meta suddenly starts to get slower over time. Our main indicator is "Get_99th_percentile" for the node that holds meta, exposed via JMX. In normal operations it is below 10ms, but after 24-48 hours it starts to increase until we reach 50ms. This is when our clients start to re-establish the connection, and fail on pre-defined timeouts.
A workaround is to simply move meta to any other node. Though all look the same load- and hardware-wise, and have an uptime which is comparable, this solves the problem for the next app. 2 days. Then it starts to happen again.
Logging GC, which is already using G1, didn't unveil any insights so far.