Support Questions
Find answers, ask questions, and share your expertise

One region for "prod.timelineservice.entity" hbase table which is referred by YARN Timeline service V2.0 is corrupted

Explorer

Hi,

In our environment  one region for the habse table "prod.timelineservice.entity" is stuck in transition. I'm seeing following metadata error while health scan:

Number of regions in transition: 1

2021-03-01 19:18:59,923 INFO  [main] util.HBaseFsck: Loading regionsinfo from the hbase:meta table

ERROR: Empty REGIONINFO_QUALIFIER found in hbase:meta

would it be okay to drop this table to recover 

 

With earlier versions of hbase I could have fixed this easily using hbck utility, but looks like with Hbase 2.x any "repair" options are not supported anymore. As I last resort I might need to copy this table to another Hbase table and then drop it to drop the problematic region. As this table is being referred by YARN timeline service, just wanted to check if this approach would be okay?

 

Appreciate any response.

7 REPLIES 7

Explorer

Update:

Because of this RIT issue balancer is unable to run and that’s what my ultimate goal is.

I have tried forced “assign” and “unassign” operation for tis region but it always times out.

 

I can see an “exclusive” lock on this region for an assignment process which hasn’t been completed successfully for a long time.

Also, with newer version of base the hbck utility has a lot of restrictions and majority of “-fix” operations are not supported so I can’t use that to fix this assignment issue.

 

I tried to take a backup of the table so I could drop it and get rid of this region but the snapshot also time out, I believe due to pre-existing lock on the region.

 

I would like to know what are my options now. Is there ay way we can explicitly remove the existing lock or kill the assignment procedure and then try to assign this region manually?

Super Collaborator

Hello @Priyanka26 

 

Thanks for using Cloudera Community. Based on the synopsis, You have 1 Region of "prod.timelineservice.entity" Table in Transition (RIT). You have tried to perform an "assign" Command, "snapshot" Command yet they are timing out. You have raised queries on HBCK2 usage as a contrast to HBCK1. 

 

Coming to the RIT, Please confirm the Region State (CLOSING, FAILED_CLOSE, OPENING, FAILED_OPEN). Accordingly, Review the reasoning for the Region State in the HMaster & RegionServer Logs using the RegionID. As YARN Timeline Service uses 1 RegionServer JVM, We have to check the Logs on 1 Host only. Once we confirm the reasoning for the RIT, We can discuss possible mitigation steps. 

 

For HBCK, HBCK2 Tool offers the functionalities of HBCK1 minus any "-fix" Command, partly because the HBCK2 Tool offers the functionalities of "-fix" individually as documented in Link [1]. With newer HBase release, most of the Command listed under the Link are available. It's likely the HBase Version being used by your Team is not having all required HBCK2 Command support. The Link list the HBase Version compatible with the Command as well. 

 

- Smarak

 

[1] https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2

Explorer

@smdas Thank you for your response. One more question regarding running HBCK2 utility in a kerberized environment. I am getting this "Failed to specify server's Kerberos principal name", even though I'm authenticated as hbase principal.

Could you please let me know if the principal needs to be passed as an external parameter? I even tried passing hbase configurations with --config option which wasn't an acceptable option.

 

==========================================

 

[root@itk-phx-prod-edge-1 ~]# kinit -kt /etc/security/keytabs/hbase.headless.keytab hbase

[root@itk-phx-prod-edge-1 ~]# klist

Ticket cache: FILE:/tmp/krb5cc_0

Default principal: hbase@PROD.DATALAKE.PHX

 

Valid starting       Expires              Service principal

03/18/2021 16:45:53  03/19/2021 16:45:53  krbtgt/PROD.DATALAKE.PHX@PROD.DATALAKE.PHX

 

===========================================

 

 

[root@itk-phx-prod-edge-1 target]# hbase hbck -j hbase-hbck2-1.2.0-SNAPSHOT.jar -s assigns hbase:namespace 1575575842296.0c72d4be7e562a2ec8a86c3ec830bdc5

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/root/hbase-hbck2/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.2.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

16:47:07.894 [main] INFO  org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Connect 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms

16:47:07.962 [ReadOnlyZKClient-itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181@0x560348e6-SendThread(itk-phx-prod-zk-2.datalake.phx:2181)] WARN  org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: Zookeeper client cannot authenticate using the Client section of the supplied JAAS configuration: '/usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf' because of a RuntimeException: java.lang.SecurityException: java.io.IOException: /usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf (No such file or directory) Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

16:47:08.253 [main] INFO  org.apache.hbase.HBCK2 - Skipped assigns command version check; 'skip' set

16:47:08.838 [main] INFO  org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Close zookeeper connection 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181

Exception in thread "main" java.io.IOException: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name

at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:111)

at org.apache.hbase.HBCK2.assigns(HBCK2.java:308)

at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:819)

at org.apache.hbase.HBCK2.run(HBCK2.java:777)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)

at org.apache.hbase.HBCK2.main(HBCK2.java:1067)

Caused by: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name

at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:336)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:95)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:571)

at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$BlockingStub.assigns(MasterProtos.java)

at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:106)

... 6 more

Caused by: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name

at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)

 

Really appreciate any insight to this.

Super Collaborator

Hello @Priyanka26 

 

Thanks for the Update. I see you have posted the Q in a new Post & I have responded to the Q over there. Let us know where you stand with respect to the RIT for "prod.timelineservice.entity" Table. Assuming you have solved the issue, Kindly share the Steps for our fellow Community Users. If the issue remains, Please do share the details discussed in our response.

 

- Smarak

Super Collaborator

Hello @Priyanka26 

 

Do let us know where you stand with the Current Post. I am aware you had issues with your Cluster courtesy of a Separate Post, yet we wish to follow-up to ensure the Post isn't left unattended from our side.

 

Thanks, Smarak

Super Collaborator

Hello @Priyanka26 

 

We wish to follow-up with your Team concerning the Post. If the issue is resolved, Do mark the Post as Solved & share the Steps followed by your Team to ensure our fellow Community Users can learn from your experience as well. 

 

Thanks, Smarak

Super Collaborator

Hello @Priyanka26 

 

We wish to follow-up with your Team concerning the Post. If the issue is resolved, Do mark the Post as Solved & share the Steps followed by your Team to ensure our fellow Community Users can learn from your experience as well. 

 

Thanks, Smarak

; ;