Created on 10-12-2015 09:27 PM - edited 09-16-2022 02:43 AM
I am trying to enable HA for Resource Mgr as well NameNode. However, very often the masters failover to standby. There is no issue with HA as such, but every failover ends up exhausting one application attempt. I notice following issues:
A series of slow fsync followed (sometimes only) by CancelledKeyException.
2015-10-12 17:22:41,000 - WARN [SyncThread:3:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:3 took 6943ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2015-10-12 17:22:41,001 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x1505bcdb3e3054e 2015-10-12 17:22:41,002 - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1505bcdb3e30003 type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved 2015-10-12 17:22:41,004 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.65.144.35:36030 which had sessionid 0x1505bcdb3e30003 2015-10-12 17:22:41,006 - ERROR [CommitProcessor:3:NIOServerCnxn@178] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:644) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
The time taken is some time as high as 10sec. This could surely timeout the clients, I suppose leading to deletion of ephemeral nodes that masters created.
Around this time, the masters switch over. I have seen that disk space is not a concern. However, at times the await time to ZK dataDir drive does show a surge.
I also confirmed that GC pauses are minimal.
Any pointers would be really appreciated.
Created 02-06-2017 04:09 AM
Hi sumit.nigam,
I see that I'm a bit late to the party, but I found your thread while looking for a solution to a problem that I have as well.
Are you hosting zookeepers on virtual machines, or on real hardware? Is zookeeper store in a dedicated disk?
Depending on the version you are running, there are guides from zookeeper that might help:
https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html
https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_maintenance
For example, if your ZK cluster has been running for a while, maybe you need to clean up some of the logs.
In either case, it always helps to know more details about the setup you are debugging. 🙂
Hope it helps (someone),
camypaj
Created 02-06-2017 04:09 AM
Hi sumit.nigam,
I see that I'm a bit late to the party, but I found your thread while looking for a solution to a problem that I have as well.
Are you hosting zookeepers on virtual machines, or on real hardware? Is zookeeper store in a dedicated disk?
Depending on the version you are running, there are guides from zookeeper that might help:
https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html
https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_maintenance
For example, if your ZK cluster has been running for a while, maybe you need to clean up some of the logs.
In either case, it always helps to know more details about the setup you are debugging. 🙂
Hope it helps (someone),
camypaj
Created 02-06-2017 07:33 PM
@samurai - Yes, there were 2 main issues. One, was that these were VMs and another was that zookeeper was collocated with another service which shared the same disk.
Created 05-31-2018 01:44 AM
please let us know how it resolved. We are even facing the same exception in our Zookeeper.