Support Questions

Find answers, ask questions, and share your expertise

Hbase region server went down because of a JVM pause,what could be reason for the jvm pause

avatar
Explorer
favorite

Region server logs

WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 129793ms No GCs detected

Since it's clear that, no GCs has been detected. What could be the reason for the JVM pause.How to find out the reason for the same?

Additional Logs: 2017-07-02 05:17:19,754 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143154ms for sessionid 0x15c6668db38d116, closing socket connection and attempting reconnect 2017-07-02 05:17:19,754 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143154ms for sessionid 0x15c6668db38d116, closing socket connection and attempting reconnect 2017-07-02 05:17:19,760 WARN [ResponseProcessor for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755625_26457] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755625_26457 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798) 2017-07-02 05:17:19,756 INFO [regionserver/server.com/1.1.1.1:16201-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143363ms for sessionid 0x15c6668db38d11b, closing socket connection and attempting reconnect 2017-07-02 05:17:19,769 INFO [server.com,16201,1498621362582_ChoreService_2] hbase.ScheduledChore: Chore: server.com,16201,1498621362582-MemstoreFlusherChore missed its start time 2017-07-02 05:17:19,769 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 129793ms No GCs detected 2017-07-02 05:17:19,770 INFO [server.com,16201,1498621362582_ChoreService_3] hbase.ScheduledChore: Chore: server.com,16201,1498621362582-HeapMemoryTunerChore missed its start time 2017-07-02 05:17:19,769 INFO [server.com,16201,1498621362582_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2017-07-02 05:17:19,760 WARN [ResponseProcessor for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755624_26456] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755624_26456 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798)

3 REPLIES 3

avatar
Guru

129793ms is a very long time indeed. I would suspect that the kernel is blocking the process and not scheduling it for some reason. I would look into kernel messages (like dmsg, /var/log/messages) etc for clues.

avatar
Explorer

After restarting, again region server went down. This time it's not because of the JVM pause. Is it again because of the kernel, which blocks on a flush of WAL's (after random delay 238839ms)??

2017-07-05 16:57:11,117 INFO  [server.com,16201,1499221788437_ChoreService_1] regionserver.HRegionServer: server.com,16201,1499221788437-MemstoreFlusherChore requesting flush of __test_table_server.com__,,1496366805684.7d0dbb83f455b1bda1a0b9efa4e9719c. because __status_family___ has an old edit so flush to free WALs after random delay 238839ms
2017-07-05 16:58:24,389 INFO  [regionserver/server.com/1.1.1.1:16201-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 80029ms for sessionid 0x15c6668db396db5, closing socket connection and attempting reconnect
2017-07-05 16:58:24,390 INFO  [server.com,16201,1499221788437_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time
2017-07-05 16:58:24,390 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 80008ms for sessionid 0x15c6668db396db4, closing socket connection and attempting reconnect
2017-07-05 16:58:24,412 INFO  [server.com,16201,1499221788437_ChoreService_1] hbase.ScheduledChore: Chore: server.com,16201,1499221788437-MemstoreFlusherChore missed its start time
2017-07-05 16:58:24,415 INFO  [server.com,16201,1499221788437_ChoreService_4] regionserver.HRegionServer: server.com,16201,1499221788437-MemstoreFlusherChore requesting flush of __test_table_server.com__,,1496366805684.7d0dbb83f455b1bda1a0b9efa4e9719c. because __status_family___ has an old edit so flush to free WALs after random delay 114843ms
2017-07-05 16:58:24,415 INFO  [server.com,16201,1499221788437_ChoreService_3] hbase.ScheduledChore: Chore: server.com,16201,1499221788437-HeapMemoryTunerChore missed its start time
2017-07-05 16:58:26,224 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2017-07-05 16:58:26,225 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
2017-07-05 16:58:26,227 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x15c6668db396db4 has expired, closing socket connection

avatar

Were you able to find the cause of the issue?

Thanks

Manish