About sumit_nigam

sumit_nigam · ‎04-30-2016

I have 3 region servers and their total size on HDFS is ~50G only. I have ulimit set to unlimited and for the hbase user also the value is very high (32K +). I am noticing following in my logs very often after which I start getting HFile corruption exceptions: 2016-04-27 16:44:46,845 WARN [StoreFileOpenerThread-g-1] hdfs.DFSClient: Failed to connect to /10.45.0.51:50010 for block, add to deadNodes and continue. java.net.SocketException: Too many open files java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method) After many of these open files issues, I get a barrage of HFile corrupt issues too and hbase fails to come up: 2016-04-27 16:44:46,313 ERROR [RS_OPEN_REGION-secas01aplpd:44461-1] handler.OpenRegionHandler: Failed open of region=lm:DS_326_A_stage,\x7F\xFF\xFF\xF8,1460147940285.1a764b8679b8565c5d6d63e349212cbf., starting to roll back the global memstore size. java.io.IOException: java.io.IOException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://mycluster/MA/hbase/data/lm/DS_326_A_stage/1a764b8679b8565c5d6d63e349212cbf/e/63083720d739491eb97544e16969ffc7 at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:836) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:747) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:718) My questions are two: 1. No other process on this node shows up too many open files issue. Even data node seems to not show this error in logs. Not sure, why then this error should be reported. 2. Would an OfflineMetaRepair following by hbck -fixMeta and hbck -fixAssignments solve the issue?

sumit_nigam · ‎04-30-2016

@Laurent Edel - Thanks, I did not think about the fact that splitting does not always create two 10G regions. I am using hbase 0.98. So, if I were to set ConstantSizeRegionSplitPolicy through hbase shell, then I can assume them to always be 10G in size?

sumit_nigam · ‎04-30-2016

@Enis - I have salted reowkeys so am hopeful that the region servers should not hotspot.

sumit_nigam · ‎04-29-2016

I notice following line in my region server logs - 2016-04-27 12:11:11,924 WARN [MemStoreFlusher.1] regionserver.CompactSplitThread: Total number of regions is approaching the upper limit 1000. Please consider taking a look at http://hbase.apache.org/book.html#ops.regionmgt And also - 2016-04-27 16:31:47,799 INFO [regionserver54130] regionserver.HRegionServer: Waiting on 4007 regions to close This is surprising because I do not have as much data. Given the default value of hbase.hregion.max.filesize is 10G, this would imply 40TB of data. That is not even the size of my disks put together. Does this mean there are many empty regions getting created? If so, why? Is there any performance implication to carrying these empty regions around? Definitely, one of them is that so many file descriptors are used up? Can I get rid of them?

sumit_nigam · ‎04-22-2016

Another question is where I can specify a value for heartbeat.monitor.interval?

sumit_nigam · ‎04-22-2016

@Devaraj Das - So, I managed to take a look at slider classes. I see it uses some heartbeat mechanism. Would you be aware of what does the agent use for heartbeat? Is it a simple 'ps' to figure out if the process is alive. Why I am trying to understand that is because if I know it is as simple as 'ps', I can likely add another script which can 'watch' the znode for this region server and shut it down locally. Which would then lead to slider AM relaunching another container. I see another option to salvage some of these containers faster by looking closely at some of these slider classes HeartbeatMonitor and AgentProviderService. The default sleep time of monitoring thread is 60sec. I see this can be controlled through heartbeat.monitor.interval property in AgentKey class. The logic is such that if 2 consecutive monitoring intervals miss a heartbeat then the container is marked as DEAD. Now, my zookeeper timeout is 40 sec. This means region server is marked dead when 40sec are over. However, agent considers it fine until 2*60 = 120 sec. So, one thing I see I need to do is make 2*heartbeat.monitor.interval = zookeeper session timeout value. Of course, if even then heartbeat is received then this logic can't help.

sumit_nigam · ‎04-19-2016

I use apache slider for launching hbase containers. Is there a setting which controls how long it takes for slider to consider region server as dead? It takes region server some time to shutdown even when HMaster marks a region server as dead. This could be due to a GC pause it is dealing with. However, slider will not launch a new container/ region server unless this container is not given up by existing region server which is hung/ already marked dead by master. In such a case, the wait time to launch a new region server instance can be arbitrarily long. How does slider monitor health of region server? Is there a way to make it sync with HMaster in deciding if region server is dead?

sumit_nigam · ‎04-14-2016

Ok, I was not aware that major compaction would invalidate block cache. Not sure why that should be so, though. Any link where I can read more on this?

sumit_nigam · ‎04-14-2016

Yes, agree about your point on skip-scan. We always use leading columns in where.

sumit_nigam · ‎04-13-2016

So, one of the things we tried was to increase eden space. Ideally, it would be better that block cache can remain in tenured while memstore mostly does not get promoted. This is because memstore flush would anyway push them out of heap. Increasing eden seems a good choice because it reduced a lot of our GC pauses. We also tried using G1 collector but despite hearing so many good things about it, we could not tune it enough to help us with hbase. In our case, writes happen both in bursts as well as at ~constant rate. Reads are usually spanning a lot of regions due to our salting of rowkeys. Could not understand your point about compactions though? Would more compactions lead to larger pauses?

Online	Offline
Last Visited	‎06-29-2017 02:18 PM

Member Since	‎04-13-2016 06:51 AM
Last Visited	‎06-29-2017 02:18 PM
Posts	36
Kudos received	4

Cloudera Community

Too many open files in region server logs

Re: Total region count

Re: Total region count

Total region count

Re: Apache slider and Hbase timeout setting

Re: Apache slider and Hbase timeout setting

Apache slider and Hbase timeout setting

Re: Do you increase zookeeper max session timeout ...

Re: Do you increase zookeeper max session timeout ...

Re: Do you increase zookeeper max session timeout ...