Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HBase replication failing and ReplicationLag in "UNKNOWN" status

Highlighted

HBase replication failing and ReplicationLag in "UNKNOWN" status

Hi All,

 

Recently we are seeing problems with HBase replication.

 

In region server logs, I see warnings like below:

 

2020-05-07 16:37:46,542 WARN org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint: Peer encountered RemoteException, rechecking all sinks:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 64100 actions: Operation rpcTimeout: 64100 times, servers with issues: hostname1,16020,1588832539626

at org.apache.hadoop.hbase.client.BatchErrors.makeException(BatchErrors.java:54)
at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.getErrors(AsyncRequestFutureImpl.java:1227)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:455)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:438)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:406)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:241)
at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:178)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.replicateWALEntry(RSRpcServices.java:2230)
at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28682)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:100)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:90)
at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:388)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:187)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:114)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException): org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 64100 actions: Operation rpcTimeout: 64100 times, servers with issues: hostname1,16020,1588832539626

 

 

I don't see any issue with the network,  I can list the peers, check replication status, run curl command from source region server/master to destination region server/master. All work fine.

 

From source HBase master Web UI, I can see that ReplicationLag=UNKNOWN , SizeOfLogQueue=56(means oldWALs are not being replicated) 

 

I listed oldWALs in /hbase/oldWALs and found around 106 files still not replicated to destination, dated more than a week ago.

 

 

Could anyone please share some troubleshooting steps?

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here