Member since
04-13-2016
36
Posts
4
Kudos Received
0
Solutions
04-30-2016
07:13 AM
I have 3 region servers and their total size on HDFS is ~50G only. I have ulimit set to unlimited and for the hbase user also the value is very high (32K +). I am noticing following in my logs very often after which I start getting HFile corruption exceptions: 2016-04-27 16:44:46,845 WARN
[StoreFileOpenerThread-g-1] hdfs.DFSClient: Failed to connect to
/10.45.0.51:50010 for block, add to deadNodes and continue. java.net.SocketException:
Too many open files
java.net.SocketException: Too many open files
at
sun.nio.ch.Net.socket0(Native Method) After many of these open files issues, I get a barrage of HFile corrupt issues too and hbase fails to come up: 2016-04-27 16:44:46,313 ERROR
[RS_OPEN_REGION-secas01aplpd:44461-1] handler.OpenRegionHandler: Failed open of
region=lm:DS_326_A_stage,\x7F\xFF\xFF\xF8,1460147940285.1a764b8679b8565c5d6d63e349212cbf.,
starting to roll back the global memstore size. java.io.IOException: java.io.IOException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
Trailer from file hdfs://mycluster/MA/hbase/data/lm/DS_326_A_stage/1a764b8679b8565c5d6d63e349212cbf/e/63083720d739491eb97544e16969ffc7 at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:836)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:747)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:718) My questions are two: 1. No other process on this node shows up too many open files issue. Even data node seems to not show this error in logs. Not sure, why then this error should be reported. 2. Would an OfflineMetaRepair following by hbck -fixMeta and hbck -fixAssignments solve the issue?
... View more
Labels:
- Labels:
-
Apache HBase
04-30-2016
07:02 AM
@Laurent Edel - Thanks, I did not think about the fact that splitting does not always create two 10G regions. I am using hbase 0.98. So, if I were to set ConstantSizeRegionSplitPolicy through hbase shell, then I can assume them to always be 10G in size?
... View more
04-30-2016
07:00 AM
@Enis - I have salted reowkeys so am hopeful that the region servers should not hotspot.
... View more
04-29-2016
05:04 PM
1 Kudo
I notice following line in my region server logs -
2016-04-27 12:11:11,924 WARN
[MemStoreFlusher.1] regionserver.CompactSplitThread: Total number of
regions is approaching the upper limit 1000. Please consider taking a look at
http://hbase.apache.org/book.html#ops.regionmgt And also - 2016-04-27 16:31:47,799 INFO
[regionserver54130] regionserver.HRegionServer: Waiting on 4007 regions to close This is surprising because I do not have as much data. Given the default value of hbase.hregion.max.filesize is 10G, this would imply 40TB of data. That is not even the size of my disks put together. Does this mean there are many empty regions getting created? If so, why? Is there any performance implication to carrying these empty regions around? Definitely, one of them is that so many file descriptors are used up? Can I get rid of them?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
04-22-2016
02:39 PM
Another question is where I can specify a value for heartbeat.monitor.interval?
... View more
04-22-2016
02:37 PM
@Devaraj Das - So, I managed to take a look at slider classes. I see it uses some heartbeat mechanism. Would you be aware of what does the agent use for heartbeat? Is it a simple 'ps' to figure out if the process is alive. Why I am trying to understand that is because if I know it is as simple as 'ps', I can likely add another script which can 'watch' the znode for this region server and shut it down locally. Which would then lead to slider AM relaunching another container. I see another option to salvage some of these containers faster by looking closely at some of these slider classes HeartbeatMonitor and AgentProviderService. The default sleep time of monitoring thread is 60sec. I see this can be controlled through heartbeat.monitor.interval property in AgentKey class. The logic is such that if 2 consecutive monitoring intervals miss a heartbeat then the container is marked as DEAD. Now, my zookeeper timeout is 40 sec. This means region server is marked dead when 40sec are over. However, agent considers it fine until 2*60 = 120 sec. So, one thing I see I need to do is make 2*heartbeat.monitor.interval = zookeeper session timeout value. Of course, if even then heartbeat is received then this logic can't help.
... View more
04-19-2016
05:53 PM
I use apache slider for launching hbase containers. Is there a setting which controls how long it takes for slider to consider region server as dead? It takes region server some time to shutdown even when HMaster marks a region server as dead. This could be due to a GC pause it is dealing with. However, slider will not launch a new container/ region server unless this container is not given up by existing region server which is hung/ already marked dead by master. In such a case, the wait time to launch a new region server instance can be arbitrarily long. How does slider monitor health of region server? Is there a way to make it sync with HMaster in deciding if region server is dead?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache YARN
04-14-2016
04:10 PM
Ok, I was not aware that major compaction would invalidate block cache. Not sure why that should be so, though. Any link where I can read more on this?
... View more
04-14-2016
04:09 PM
Yes, agree about your point on skip-scan. We always use leading columns in where.
... View more
04-13-2016
09:00 AM
So, one of the things we tried was to increase eden space. Ideally, it would be better that block cache can remain in tenured while memstore mostly does not get promoted. This is because memstore flush would anyway push them out of heap. Increasing eden seems a good choice because it reduced a lot of our GC pauses. We also tried using G1 collector but despite hearing so many good things about it, we could not tune it enough to help us with hbase. In our case, writes happen both in bursts as well as at ~constant rate. Reads are usually spanning a lot of regions due to our salting of rowkeys. Could not understand your point about compactions though? Would more compactions lead to larger pauses?
... View more
- « Previous
- Next »