- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Apache slider and Hbase timeout setting
- Labels:
-
Apache HBase
-
Apache YARN
Created on ‎04-19-2016 05:53 PM - edited ‎09-16-2022 03:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use apache slider for launching hbase containers. Is there a setting which controls how long it takes for slider to consider region server as dead? It takes region server some time to shutdown even when HMaster marks a region server as dead. This could be due to a GC pause it is dealing with. However, slider will not launch a new container/ region server unless this container is not given up by existing region server which is hung/ already marked dead by master. In such a case, the wait time to launch a new region server instance can be arbitrarily long. How does slider monitor health of region server? Is there a way to make it sync with HMaster in deciding if region server is dead?
Created ‎05-02-2016 03:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you are correct that it is just seeing if the PID still exists. It should be related to this code in the app package regionserver script: https://github.com/apache/incubator-slider/blob/develop/app-packages/hbase/package/scripts/hbase_reg...
The actual heartbeat is done in the agent Controller: https://github.com/apache/incubator-slider/blob/develop/slider-agent/src/main/python/agent/Controlle...
Also, you can specify the heartbeat.monitor.interval in the appConfig.json (in milliseconds):
{ "schema": "http://example.org/specification/v2.0.0", "metadata": { }, "global": { "heartbeat.monitor.interval": "60000", ... } }
Created ‎04-20-2016 10:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good question @Sumit Nigam. I think there is no sync between the HMaster and the slider AppMaster w.r.t the regionserver's expiry, and you could run into the situation you described. This would be a good thing to think about... @Ram Venkatesh
Created ‎04-22-2016 02:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Devaraj Das - So, I managed to take a look at slider classes. I see it uses some heartbeat mechanism. Would you be aware of what does the agent use for heartbeat? Is it a simple 'ps' to figure out if the process is alive. Why I am trying to understand that is because if I know it is as simple as 'ps', I can likely add another script which can 'watch' the znode for this region server and shut it down locally. Which would then lead to slider AM relaunching another container.
I see another option to salvage some of these containers faster by looking closely at some of these slider classes HeartbeatMonitor and AgentProviderService. The default sleep time of monitoring thread is 60sec. I see this can be controlled through heartbeat.monitor.interval property in AgentKey class. The logic is such that if 2 consecutive monitoring intervals miss a heartbeat then the container is marked as DEAD. Now, my zookeeper timeout is 40 sec. This means region server is marked dead when 40sec are over. However, agent considers it fine until 2*60 = 120 sec. So, one thing I see I need to do is make 2*heartbeat.monitor.interval = zookeeper session timeout value. Of course, if even then heartbeat is received then this logic can't help.
Created ‎04-22-2016 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another question is where I can specify a value for heartbeat.monitor.interval?
Created ‎05-02-2016 02:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Devaraj Das - Is there any way that you are aware through which I can find the mechanism used by slider to heartbeat the container? I am being told that it can take up to 15-20 minutes to get back the container.
Created ‎05-02-2016 03:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you are correct that it is just seeing if the PID still exists. It should be related to this code in the app package regionserver script: https://github.com/apache/incubator-slider/blob/develop/app-packages/hbase/package/scripts/hbase_reg...
The actual heartbeat is done in the agent Controller: https://github.com/apache/incubator-slider/blob/develop/slider-agent/src/main/python/agent/Controlle...
Also, you can specify the heartbeat.monitor.interval in the appConfig.json (in milliseconds):
{ "schema": "http://example.org/specification/v2.0.0", "metadata": { }, "global": { "heartbeat.monitor.interval": "60000", ... } }
Created ‎05-03-2016 04:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@billie - Thank you for the info. So, it is exactly as I thought. And in my opinion ps is completely wrong in the context of hbase because even with ps coming back successfully, the region server is dead for all practical purposes. Unfortunately, because of this my idea of reducing heartbeat.monitor.interval will also not make too much difference because ps will be fine.
