We've been experiencing an increase in CPU usage from one of the region servers on our cluster.
So far, we've investigated:
1. Memory issues
- The master and RS have enough memory and increasing it doesn't change anything.
- There is no increase in GC count or time.
- There is no increase in the number of requests.
2. Network issues
Zookeeper keeps logging errors due to sockets being closed by client. We've done both: increase and reduce timeouts and it doesn't change anything. Also we have another cluster with the same specifications, logging timeout errors as well, and HBase behaves just fine.
The increase in CPU usage is correlated with an increase in WAL append sizes.
We have a small cluster 2 nodes, master and problematic RS are on node1.