I have a cluster with 8 worker nodes (DN,NM and RS). The dev team are running a MapReduce program using an Oozie workflow. This step in the workflow is a MR job to enter the data into HBase tables. There are basically two things that happen
1. Heavy load on had01 causes the Region Server to shut down. The other Region Servers are working fine but the issue only seems to be on this one. I see a lot of JVM pauses in the log (Non GC) and it loses connection to the ZooKeepers before shutting down.
2. In the case the RS doesn't shut down, I still see heavy load on this node (121.2, 97.3, 87.3) and the map tasks that run on this node take much much more longer than on other nodes.
Others nodes -> less than 2 mins
Had01 -> 7 + mins
1. Heavy I/O (700 MB/s - 2 GB/s)
2. Number of blocks on this node is twice when compared to the other 7 nodes.
3. HBase Web UI shows the Write Request Count for this node as 0
Can someone point me where I can troubleshoot more? It only seems to happen when this step of the workflow is running. It comes back up anad is stable after this.