Support Questions

sumit.nigam · ‎10-12-2015

I am trying to enable HA for Resource Mgr as well NameNode. However, very often the masters failover to standby. There is no issue with HA as such, but every failover ends up exhausting one application attempt. I notice following issues:

A series of slow fsync followed (sometimes only) by CancelledKeyException.

2015-10-12 17:22:41,000 - WARN [SyncThread:3:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:3 took 6943ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2015-10-12 17:22:41,001 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x1505bcdb3e3054e

2015-10-12 17:22:41,002 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1505bcdb3e30003 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved
2015-10-12 17:22:41,004 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.65.144.35:36030 which had sessionid 0x1505bcdb3e30003
2015-10-12 17:22:41,006 - ERROR [CommitProcessor:3:NIOServerCnxn@178] - Unexpected Exception: 
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:644)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)

The time taken is some time as high as 10sec. This could surely timeout the clients, I suppose leading to deletion of ephemeral nodes that masters created.

Around this time, the masters switch over. I have seen that disk space is not a concern. However, at times the await time to ZK dataDir drive does show a surge.

I also confirmed that GC pauses are minimal.

Any pointers would be really appreciated.

samurai · ‎02-06-2017

Hi sumit.nigam,

I see that I'm a bit late to the party, but I found your thread while looking for a solution to a problem that I have as well.

Are you hosting zookeepers on virtual machines, or on real hardware? Is zookeeper store in a dedicated disk?

Depending on the version you are running, there are guides from zookeeper that might help:

https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html

https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_maintenance

For example, if your ZK cluster has been running for a while, maybe you need to clean up some of the logs.

In either case, it always helps to know more details about the setup you are debugging. 🙂

Hope it helps (someone),

camypaj

View solution in original post

samurai · ‎02-06-2017