Yesterday I performed a number of region mergers on one of our larger tables. This table had ~1400 regions when starting, but many of these were small regions - we wanted to get the average region size closer to our region limit size of 15GB. The region mergers went well, leaving us with just over 700 regions. However since then we have had a huge number of offline regions - nearly 600 regions are currently listed as offline. Does anybody know the cause of these offline regions, and how to fix them? I understand regions do sometimes go offline legally, but this many seems to me to indicate a problem.
Are you saying that "after merge, 700 regions online, then 600 out of 700 regions went offline for no reason"? When you merge two child regions to a big region, the child regions would be offline before getting cleaned up. When a region server is down, regions in that region server would be offline until they get re-assigned. If none of the previous two reasons I listed, then we need to check master log to figure out root cause? (Also if you have backup master, could you do master failover to see whether the problem goes away; or restarting the cluster).
Thanks for the response. No, the 700 regions were still up, but there were also 600 offline regions. If these are the child regions from the merger, do you know when the child regions would be cleared up? We had to do a restore_snapshot because of some system instability, and now the online region count is correct but the offline region count is over 2000. Will these clear up on their own or only after a master restart? We can't restart the cluster because this is a customer facing system.
hbck gives one inconsistency - a single Empty REGIONINFO_QUALIFIER. I know hbck has a tool to fix this, I haven't yet run it. That's the only inconsistency shown. So would this indicate that the offline regions are normal? Thanks.