Created 07-05-2017 06:48 PM
favorite | Region server logs WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 129793ms No GCs detected Since it's clear that, no GCs has been detected. What could be the reason for the JVM pause.How to find out the reason for the same? Additional Logs: 2017-07-02 05:17:19,754 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143154ms for sessionid 0x15c6668db38d116, closing socket connection and attempting reconnect 2017-07-02 05:17:19,754 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143154ms for sessionid 0x15c6668db38d116, closing socket connection and attempting reconnect 2017-07-02 05:17:19,760 WARN [ResponseProcessor for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755625_26457] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755625_26457 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798) 2017-07-02 05:17:19,756 INFO [regionserver/server.com/1.1.1.1:16201-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 143363ms for sessionid 0x15c6668db38d11b, closing socket connection and attempting reconnect 2017-07-02 05:17:19,769 INFO [server.com,16201,1498621362582_ChoreService_2] hbase.ScheduledChore: Chore: server.com,16201,1498621362582-MemstoreFlusherChore missed its start time 2017-07-02 05:17:19,769 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 129793ms No GCs detected 2017-07-02 05:17:19,770 INFO [server.com,16201,1498621362582_ChoreService_3] hbase.ScheduledChore: Chore: server.com,16201,1498621362582-HeapMemoryTunerChore missed its start time 2017-07-02 05:17:19,769 INFO [server.com,16201,1498621362582_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2017-07-02 05:17:19,760 WARN [ResponseProcessor for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755624_26456] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2076318769-1.1.1.1-1496366779987:blk_1073755624_26456 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798) |
Created 07-06-2017 04:08 PM
129793ms is a very long time indeed. I would suspect that the kernel is blocking the process and not scheduling it for some reason. I would look into kernel messages (like dmsg, /var/log/messages) etc for clues.
Created 07-06-2017 10:26 PM
After restarting, again region server went down. This time it's not because of the JVM pause. Is it again because of the kernel, which blocks on a flush of WAL's (after random delay 238839ms)??
2017-07-05 16:57:11,117 INFO [server.com,16201,1499221788437_ChoreService_1] regionserver.HRegionServer: server.com,16201,1499221788437-MemstoreFlusherChore requesting flush of __test_table_server.com__,,1496366805684.7d0dbb83f455b1bda1a0b9efa4e9719c. because __status_family___ has an old edit so flush to free WALs after random delay 238839ms 2017-07-05 16:58:24,389 INFO [regionserver/server.com/1.1.1.1:16201-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 80029ms for sessionid 0x15c6668db396db5, closing socket connection and attempting reconnect 2017-07-05 16:58:24,390 INFO [server.com,16201,1499221788437_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2017-07-05 16:58:24,390 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 80008ms for sessionid 0x15c6668db396db4, closing socket connection and attempting reconnect 2017-07-05 16:58:24,412 INFO [server.com,16201,1499221788437_ChoreService_1] hbase.ScheduledChore: Chore: server.com,16201,1499221788437-MemstoreFlusherChore missed its start time 2017-07-05 16:58:24,415 INFO [server.com,16201,1499221788437_ChoreService_4] regionserver.HRegionServer: server.com,16201,1499221788437-MemstoreFlusherChore requesting flush of __test_table_server.com__,,1496366805684.7d0dbb83f455b1bda1a0b9efa4e9719c. because __status_family___ has an old edit so flush to free WALs after random delay 114843ms 2017-07-05 16:58:24,415 INFO [server.com,16201,1499221788437_ChoreService_3] hbase.ScheduledChore: Chore: server.com,16201,1499221788437-HeapMemoryTunerChore missed its start time 2017-07-05 16:58:26,224 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2017-07-05 16:58:26,225 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 2017-07-05 16:58:26,227 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x15c6668db396db4 has expired, closing socket connection
Created 08-27-2018 03:42 PM
Were you able to find the cause of the issue?
Thanks
Manish