Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Dead region servers

avatar
Contributor

Hello everybody, basically there was an electric problem and the cluster was suddently shutdown.
After restarting everything Hbase results to have all the Region Servers online (but with 0 regions each) and the Region Server with the same names are shown in Dead Region Servers.
Everytime i restart hbase, new rows are add in the Dead Region Server .
This already happened to me long time ago and the problem was related to zookeeper, but i can't find the old post.
Do you know what i can do? Thanks

P.S. my cluster is kerberized, hbase version 2.0

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hello @loridigia ,

 

It seems due to the outage there would be multiple ServerCrashProcedures created for the Regionservers. The Dead region severs with same names are different instances of the Region servers with a different epoch timestamp. As the Hbase Master was also down, it might be possible that it was not able to process the expiration of the Region servers. You might see some Crash procedures waiting to be finished under "Procedures & Locks" section of the Active Hbase Master Web UI.

As you have already solved this issue in the past involving zookeeper. I guess you can try this :

1. Stop Hbase

2. Login to zookeeper using #hbase zkcli ( with a valid hbase ticket )

3. Delete the /hbase-secure znode. rmr /hbase-secure

4. Sideline the entries under HDFS dir. hdfs dfs -mv /hbase/MasterProcWALs/*  /tmp. ( Not sure if this was done earlier )

5. Start Hbase

View solution in original post

6 REPLIES 6

avatar
Super Collaborator

Hello @loridigia ,

 

It seems due to the outage there would be multiple ServerCrashProcedures created for the Regionservers. The Dead region severs with same names are different instances of the Region servers with a different epoch timestamp. As the Hbase Master was also down, it might be possible that it was not able to process the expiration of the Region servers. You might see some Crash procedures waiting to be finished under "Procedures & Locks" section of the Active Hbase Master Web UI.

As you have already solved this issue in the past involving zookeeper. I guess you can try this :

1. Stop Hbase

2. Login to zookeeper using #hbase zkcli ( with a valid hbase ticket )

3. Delete the /hbase-secure znode. rmr /hbase-secure

4. Sideline the entries under HDFS dir. hdfs dfs -mv /hbase/MasterProcWALs/*  /tmp. ( Not sure if this was done earlier )

5. Start Hbase

avatar
Contributor

Hi @rki_and thanks for your answer, was exaclty was needed.
But, if i may ask, after that i see all regions server online, 0 offline and all regions on 1 region server execept for meta that is on another one (in total i have 3).
The problem is that i got this error in master:

org.apache.hadoop.hbase.NotServingRegionException: hbase:quota,,1620896369946.28dd7c81713c9347e8dfe4e6993b1ec7. is not online on my-server3.domain.com,16020,1658432084980


Do you have any idea on what could i do?

Thanks

avatar
Super Collaborator

Hello @loridigia 

You can try to assign the region from hbase shell.

> assign '28dd7c81713c9347e8dfe4e6993b1ec7'

If you can attach the below command output (with valid ticket ), we can check which all regions are offiline or in transition.

 

# hbase hbck -details

avatar
Contributor

Hi RKI, the command worked, that error now is gone... but doing "hbase hbck -details" i goit 560 inconsistencies all equals: 

ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.


avatar
Super Collaborator

Hi,

A hole in region chain most probably indicates there are some regions which are not yet online and hence creates a hole.

 

# cat hbck.report | grep "not deployed on any region server"

 

If you see regions in the above command output, you will need to assign them using hbase shell.

avatar
Contributor

You are a SAVIOUR !!
I made a script to assign all regions with ""not deployed on any region server"" and now it works fine!!
Awesome thanks a lot mate!