Reply
Contributor
Posts: 126
Registered: ‎10-15-2014

hbase replication halted

my hbase replication has stopped

on version 1.0.0-cdh5.4.8, rUnknown, Thu Oct 15 08:57:42 PDT 2015

I have 2 clusters in 2 different datacenters

1 is master the other is slave

 

 

I see the following errors in log

 

 

2016-04-13 22:32:50,217 WARN org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of a local or network error:
java.io.IOException: Call to hadoop2-private.sjc03.infra.ripple.com/10.160.22.99:60020 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1014, waitTime=1200001, operationTimeout=1200000 expired.
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1255)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1223)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
	at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
	at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1014, waitTime=1200001, operationTimeout=1200000 expired.
	at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1197)
	... 7 more

 

 

which in turn fills the queue and I get

2016-04-13 22:35:19,555 WARN org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of an error on the remote cluster:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small?
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
	at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
	at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)

My peers look good and this was working until Mar 27

we did have an inadvertant outage but I was able to restore all cluster services

 

 

 

status 'replication'
version 1.0.0-cdh5.4.8
5 live servers
    hadoop5-private.wdc01.infra.ripple.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=1538240180, SizeOfLogQueue=2135, TimeStampsOfLastShippedOp=Sun Mar 27 04:00:42 GMT+00:00 2016, Replication Lag=1539342209
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Tue Mar 22 10:09:39 GMT+00:00 2016
    hadoop2-private.wdc01.infra.ripple.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=810222876, SizeOfLogQueue=1302, TimeStampsOfLastShippedOp=Mon Apr 04 14:31:37 GMT+00:00 2016, Replication Lag=810287122
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Mar 25 21:20:59 GMT+00:00 2016
    hadoop4-private.wdc01.infra.ripple.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602417946, SizeOfLogQueue=190, TimeStampsOfLastShippedOp=Thu Apr 07 00:06:38 GMT+00:00 2016, Replication Lag=602983605
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Mon Apr 04 14:35:56 GMT+00:00 2016
    hadoop1-private.wdc01.infra.ripple.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602574285, SizeOfLogQueue=183, TimeStampsOfLastShippedOp=Thu Apr 07 00:10:29 GMT+00:00 2016, Replication Lag=602753383
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr 07 00:10:23 GMT+00:00 2016
    hadoop3-private.wdc01.infra.ripple.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602002192, SizeOfLogQueue=1148, TimeStampsOfLastShippedOp=Thu Apr 07 00:06:52 GMT+00:00 2016, Replication Lag=602971172
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr 07 00:06:50 GMT+00:00 2016

 

I can curl the quorum I set so I don't think its network 

 

 

What can I do to troubleshoot?

 

Tried to run the following 

hbase org.apache.hadoop.hbase.replication.regionserver.ReplicationSyncUp 100000

got the following response 

16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.125.122.237:50784, server: hadoop2-private.sjc03.infra.ripple.com/10.160.22.99:2181
16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Session establishment complete on server hadoop2-private.sjc03.infra.ripple.com/10.160.22.99:2181, sessionid = 0x252f1a90269f5d6, negotiated timeout = 150000
16/04/13 23:37:17 INFO regionserver.ReplicationSource: Replicating de6643f5-2a36-413e-b55f-8840b26395b1 -> 06a68811-0e50-4802-a478-d199df96bf85
16/04/13 23:37:27 INFO regionserver.ReplicationSource: Closing source 1 because: Region server is closing
16/04/13 23:37:27 WARN regionserver.ReplicationSource: Interrupted while reading edits
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
        at java.util.concurrent.PriorityBlockingQueue.poll(PriorityBlockingQueue.java:553)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.getNextPath(ReplicationSource.java:489)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:308)
16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x252f1a90269f5d6 closed
16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 23:37:27 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152f1a8ff4ef600
16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x152f1a8ff4ef600 closed
16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 23:37:31 INFO zookeeper.ZooKeeper: Session: 0x153ee0d274c3c6a closed
16/04/13 23:37:31 INFO zookeeper.ClientCnxn: EventThread shut down

I am willing to lose the queue if there is a way to flush and reset the sync process

 

I can do distscp of various data and manually load my tables to play catchup as long as I can flush the queue

 

 

 

Expert Contributor
Posts: 101
Registered: ‎01-24-2014

Re: hbase replication halted

All the data to be replicated should be in /hbase/oldWALs

The chore to clean them up won't do so until the znode in /hbase/replication for that WAL is cleared, So while you probably aren't in danger of losing data, you are in danger of filling up HDFS.

 

Maybe check Master RS to Slave RS connectivity? sounds like you checked Master RS -> Peer ZK

 

The question about how to manually sync up replication is covered here[1] along with maybe some more helpful information. 

 

As far as I understand, cycling the regionservers on the master cluster should force them to race to grab the replication znode lock for the outstanding WALs and attempt to restart the replication process. I've never personally had to do this, so please do validate this statement for yourself before proceeding.

 

[1]http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_bdr_hbase_replication.html