10-12-2015 09:27 PM
I am trying to enable HA for Resource Mgr as well NameNode. However, very often the masters failover to standby. There is no issue with HA as such, but every failover ends up exhausting one application attempt. I notice following issues:
A series of slow fsync followed (sometimes only) by CancelledKeyException.
2015-10-12 17:22:41,000 - WARN [SyncThread:3:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:3 took 6943ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2015-10-12 17:22:41,001 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x1505bcdb3e3054e 2015-10-12 17:22:41,002 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1505bcdb3e30003 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved 2015-10-12 17:22:41,004 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.65.144.35:36030 which had sessionid 0x1505bcdb3e30003 2015-10-12 17:22:41,006 - ERROR [CommitProcessor:3:NIOServerCnxn@178] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:644) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
The time taken is some time as high as 10sec. This could surely timeout the clients, I suppose leading to deletion of ephemeral nodes that masters created.
Around this time, the masters switch over. I have seen that disk space is not a concern. However, at times the await time to ZK dataDir drive does show a surge.
I also confirmed that GC pauses are minimal.
Any pointers would be really appreciated.
02-06-2017 04:09 AM
I see that I'm a bit late to the party, but I found your thread while looking for a solution to a problem that I have as well.
Are you hosting zookeepers on virtual machines, or on real hardware? Is zookeeper store in a dedicated disk?
Depending on the version you are running, there are guides from zookeeper that might help:
For example, if your ZK cluster has been running for a while, maybe you need to clean up some of the logs.
In either case, it always helps to know more details about the setup you are debugging. :)
Hope it helps (someone),
02-06-2017 07:33 PM
@samurai - Yes, there were 2 main issues. One, was that these were VMs and another was that zookeeper was collocated with another service which shared the same disk.