We recently had a failure of all of the region servers in our cluster, although the active master and standby master stayed up. When the region servers were brought back up regions were assigned relatively quickly. However two regions have not come up, and according to the UI they are stuck in the OFFLINE state.
I have tried running hbase hbck -repair a number of times, as well as various other options that I hoped would help (-fixAssignments, -fixSplitParents). None of these successfully brought the regions online. I check the logs of the region servers for the regions and there was no reference to them after they were closed prior to region server failure.
When I checked the master logs however I found the following:
master.AssignmentManager: Skip assigning table_name,13153,1485460927890.3d68e485cb6294345fe1469097fa5aca., it is on a dead but not processed yet server: server05,16020,1494493877392
The server listed as a dead server is alive and well, with over 200 regions already assigned to it. This error message led me to HBASE-13605, HBASE-13330 and HBASE-12440 which all describe pretty much the same issue. Unfortunately none of these JIRAs describe any way to fix the issue once it occurs.
Does anybody have any advice for resolving this? This is a production system and so shutting down the master is a last resort.
You can do this:
- Manually kill the server server05 by issuing kill -9. This will cause the master to recognize that the server is dead, and will re-assign the regions that were hosted there.
Also you can safely restart the master in a production env. Nothing in Hbase client depends on master being available, in regular read / write paths (only DDL statements). Master is pretty light and will come up quikly, so you can restart masters safely.