Created 04-30-2016 07:13 AM
I have 3 region servers and their total size on HDFS is ~50G only. I have ulimit set to unlimited and for the hbase user also the value is very high (32K +). I am noticing following in my logs very often after which I start getting HFile corruption exceptions:
2016-04-27 16:44:46,845 WARN [StoreFileOpenerThread-g-1] hdfs.DFSClient: Failed to connect to /10.45.0.51:50010 for block, add to deadNodes and continue. java.net.SocketException: Too many open files java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method)
After many of these open files issues, I get a barrage of HFile corrupt issues too and hbase fails to come up:
2016-04-27 16:44:46,313 ERROR [RS_OPEN_REGION-secas01aplpd:44461-1] handler.OpenRegionHandler: Failed open of region=lm:DS_326_A_stage,\x7F\xFF\xFF\xF8,1460147940285.1a764b8679b8565c5d6d63e349212cbf., starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://mycluster/MA/hbase/data/lm/DS_326_A_stage/1a764b8679b8565c5d6d63e349212cbf/e/63083720d739491eb97544e16969ffc7
at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:836) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:747) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:718)
My questions are two:
1. No other process on this node shows up too many open files issue. Even data node seems to not show this error in logs. Not sure, why then this error should be reported.
2. Would an OfflineMetaRepair following by hbck -fixMeta and hbck -fixAssignments solve the issue?
Created 05-02-2016 10:12 AM
Depending on your OS the setting might be different then you expect. To check the actual value become root and switch to the user hbase and print the actual limits.
# on Hbase Region Server: sudo -i su hbase # print limits for the user hbase: ulimit -a
On our RedHat 6 system, there was a file 90-nproc.conf in /etc/security/limits.d/ deployed. This limits the nr of processes for users to 1024. The user ambari received these limits and when starting hbase from ambari the limits are passed over somehow.
As @rmaruthiyodan mentions you can check the running process limits.
grep 'open files' /proc/<Ambari Agent PID>/limits grep 'open files' /proc/<Region Server PID>/limits
Hbase book config suggests: 'Set it to north of 10k'
Created 04-30-2016 09:24 AM
hello Sumit
If your ulimit is already set to unlimited or a very high number, you could actually getting insight on the number of open files with lsof | wc -l. You may need to increase the max number of filed handles in the os. check fs.file-max to see if this helps. this is to try to solve the cause.
An offlineMetaRepair, fix meta should help with the consequence.
Created 04-30-2016 12:51 PM
@nmaillard - Thanks. Yes, I am aware of lsof and was planning to use it. Also could there be a setting in hbase which restricts number of open file handles in hbase itself and throws this error?
Also, you meant /proc/sys/fs/file-max?
Thanks
Created 05-02-2016 09:21 AM
Hi Sumit,
You may also want to verify that the ulimit that is set, is actually applied to the process :
# cat /proc/<Region Server PID>/limits
It is possible that somehow the user limits are overridden when the process starts up.
Created 05-02-2016 02:44 PM
Hey @rmaruthiyodan - Thanks. Yes, I had to use /proc to find region server PID specific limits. Basically, ambari restricts this number to 32K by default and this can be overridden in blueprint being submitted.
Created 05-02-2016 10:12 AM
Depending on your OS the setting might be different then you expect. To check the actual value become root and switch to the user hbase and print the actual limits.
# on Hbase Region Server: sudo -i su hbase # print limits for the user hbase: ulimit -a
On our RedHat 6 system, there was a file 90-nproc.conf in /etc/security/limits.d/ deployed. This limits the nr of processes for users to 1024. The user ambari received these limits and when starting hbase from ambari the limits are passed over somehow.
As @rmaruthiyodan mentions you can check the running process limits.
grep 'open files' /proc/<Ambari Agent PID>/limits grep 'open files' /proc/<Region Server PID>/limits
Hbase book config suggests: 'Set it to north of 10k'