Support Questions

Find answers, ask questions, and share your expertise

Apache slider and Hbase timeout setting

avatar
Rising Star

I use apache slider for launching hbase containers. Is there a setting which controls how long it takes for slider to consider region server as dead? It takes region server some time to shutdown even when HMaster marks a region server as dead. This could be due to a GC pause it is dealing with. However, slider will not launch a new container/ region server unless this container is not given up by existing region server which is hung/ already marked dead by master. In such a case, the wait time to launch a new region server instance can be arbitrarily long. How does slider monitor health of region server? Is there a way to make it sync with HMaster in deciding if region server is dead?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I think you are correct that it is just seeing if the PID still exists. It should be related to this code in the app package regionserver script: https://github.com/apache/incubator-slider/blob/develop/app-packages/hbase/package/scripts/hbase_reg...

The actual heartbeat is done in the agent Controller: https://github.com/apache/incubator-slider/blob/develop/slider-agent/src/main/python/agent/Controlle...

Also, you can specify the heartbeat.monitor.interval in the appConfig.json (in milliseconds):

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
    "heartbeat.monitor.interval": "60000",
    ...
  }
}

View solution in original post

6 REPLIES 6

avatar
Explorer

Good question @Sumit Nigam. I think there is no sync between the HMaster and the slider AppMaster w.r.t the regionserver's expiry, and you could run into the situation you described. This would be a good thing to think about... @Ram Venkatesh

avatar
Rising Star

@Devaraj Das - So, I managed to take a look at slider classes. I see it uses some heartbeat mechanism. Would you be aware of what does the agent use for heartbeat? Is it a simple 'ps' to figure out if the process is alive. Why I am trying to understand that is because if I know it is as simple as 'ps', I can likely add another script which can 'watch' the znode for this region server and shut it down locally. Which would then lead to slider AM relaunching another container.

I see another option to salvage some of these containers faster by looking closely at some of these slider classes HeartbeatMonitor and AgentProviderService. The default sleep time of monitoring thread is 60sec. I see this can be controlled through heartbeat.monitor.interval property in AgentKey class. The logic is such that if 2 consecutive monitoring intervals miss a heartbeat then the container is marked as DEAD. Now, my zookeeper timeout is 40 sec. This means region server is marked dead when 40sec are over. However, agent considers it fine until 2*60 = 120 sec. So, one thing I see I need to do is make 2*heartbeat.monitor.interval = zookeeper session timeout value. Of course, if even then heartbeat is received then this logic can't help.

avatar
Rising Star

Another question is where I can specify a value for heartbeat.monitor.interval?

avatar
Rising Star

@Devaraj Das - Is there any way that you are aware through which I can find the mechanism used by slider to heartbeat the container? I am being told that it can take up to 15-20 minutes to get back the container.

avatar
Expert Contributor

I think you are correct that it is just seeing if the PID still exists. It should be related to this code in the app package regionserver script: https://github.com/apache/incubator-slider/blob/develop/app-packages/hbase/package/scripts/hbase_reg...

The actual heartbeat is done in the agent Controller: https://github.com/apache/incubator-slider/blob/develop/slider-agent/src/main/python/agent/Controlle...

Also, you can specify the heartbeat.monitor.interval in the appConfig.json (in milliseconds):

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
    "heartbeat.monitor.interval": "60000",
    ...
  }
}

avatar
Rising Star

@billie - Thank you for the info. So, it is exactly as I thought. And in my opinion ps is completely wrong in the context of hbase because even with ps coming back successfully, the region server is dead for all practical purposes. Unfortunately, because of this my idea of reducing heartbeat.monitor.interval will also not make too much difference because ps will be fine.