A particular worker node region server is getting down very frequently. Restarting the service also doesn't resolve issue. can anyone give me an idea of what might be the issue or what things i need to look for in logs?
The explanation might be a little high level to help efficiently. I understand this is for a specific Region Server not all of them or a random one. A couple things can make a Region Server go down . Usual culprits are: Skew, by this I mean this Region Server gets a lot of traffic, for example writes, he will then be flushing the memstore very often and having a lot of GCs to clean out memory and if these last too long he may not be able to heartbeat to zookeeper in the predefined time window. Zookeeper will then take him out. You can log in the logs for the memstore flush and GC clean up. You should also see Zookeeper timeouts warning.
@nmaillard I am getting this error on AMBARI UI on checking the response link for a particular worker node. Can you let me know why is this happening and what can be the possible way to get this issue resolved? I tried restarting the zookeeper service as well but to no effect.