We've been experiencing an increase in CPU usage from one of the region servers on our cluster.
So far, we've investigated:
1. Memory issues
- The master and RS have enough memory and increasing it doesn't change anything.
- There is no increase in GC count or time.
- There is no increase in the number of requests.
2. Network issues
Zookeeper keeps logging errors due to sockets being closed by client. We've done both: increase and reduce timeouts and it doesn't change anything. Also we have another cluster with the same specifications, logging timeout errors as well, and HBase behaves just fine.
The increase in CPU usage is correlated with an increase in WAL append sizes.
We have a small cluster 2 nodes, master and problematic RS are on node1.
We have Phoenix installed.
- Ambari: HDP-2.6.5
- HBase : 1.1.2
- HDFS : 2.7.3
- Zookeeper: 3.4.6
Any help would be much appreciated.
1) check the region server logs is there “responseTooSlow” or “operationTooSlow” or any other WARN/ERROR messages. please provide log snippets.
2) if we are seeing the "responseTooSlow" on the region servers, please check the data node logs for the underlying issue from the data node logs.
3)In the data node logs please check we have below ERROR/WARN in the data node logs are not.
Slow BlockReceiver write data to disk cost - This indicates that there was a delay in writing the block to the OS cache or disk.
Slow BlockReceiver write packet to mirror took - This indicates that there was a delay in writing the block across the network
Slow flushOrSync took/Slow manageWriterOsCache took - This indicates that there was a delay in writing the block to the OS cache or disk
4) If we have the above ERROR/WARN we need to check the infra team and OS vendor team to fix the underlying hardware issues to overcome issue.
There are many reasons this could happen including OS/Kernel bugs (update your system), swap, transparent huge pages, pauses by a hypervisor for the High CPU usage issues and you need to figure out which is causing the issue and need to fix it to overcome the issue.
Thanks for using Cloudera Community. Based on the Post, 1 Region Server is using High CPU. As requested by @PrathapKumar, Review the same. Additionally, Your Team can perform the below:
(I) When the Region Server JVM reports High CPU, Open "top" Command for the Region Server PID,
(II) Use "Shift H" to open the Thread View of the PID. This would show the Threads within the Region Server JVM with CPU Usage,
(III) Monitor the Thread View & Identify the Thread hitting the Max CPU Usage,
(IV) Take Thread Dump | JStack of Region Server PID & Compare the Thread with the "top" Thread View consuming the Highest CPU.
The above Process would allow you to identify the Thread contributing towards the CPU Usage. Compare the same with other Region Server & your Team can make a Conclusive Call to identify the reasoning for CPU Utilization. Howsoever Logs are reviewed, Narrowing the Focus of JVM review would assist in identifying the Cause. Review shared Link for additional reference.
Kindly review & share your Observation in the Post.