Support Questions

Find answers, ask questions, and share your expertise

Too many open files in region server logs

avatar
Rising Star

I have 3 region servers and their total size on HDFS is ~50G only. I have ulimit set to unlimited and for the hbase user also the value is very high (32K +). I am noticing following in my logs very often after which I start getting HFile corruption exceptions:

2016-04-27 16:44:46,845 WARN [StoreFileOpenerThread-g-1] hdfs.DFSClient: Failed to connect to /10.45.0.51:50010 for block, add to deadNodes and continue. java.net.SocketException: Too many open files java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method)

After many of these open files issues, I get a barrage of HFile corrupt issues too and hbase fails to come up:

2016-04-27 16:44:46,313 ERROR [RS_OPEN_REGION-secas01aplpd:44461-1] handler.OpenRegionHandler: Failed open of region=lm:DS_326_A_stage,\x7F\xFF\xFF\xF8,1460147940285.1a764b8679b8565c5d6d63e349212cbf., starting to roll back the global memstore size.

java.io.IOException: java.io.IOException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://mycluster/MA/hbase/data/lm/DS_326_A_stage/1a764b8679b8565c5d6d63e349212cbf/e/63083720d739491eb97544e16969ffc7

at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:836) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:747) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:718)

My questions are two:

1. No other process on this node shows up too many open files issue. Even data node seems to not show this error in logs. Not sure, why then this error should be reported.

2. Would an OfflineMetaRepair following by hbck -fixMeta and hbck -fixAssignments solve the issue?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Depending on your OS the setting might be different then you expect. To check the actual value become root and switch to the user hbase and print the actual limits.

# on Hbase Region Server:
sudo -i
su hbase

# print limits for the user hbase:
ulimit -a

On our RedHat 6 system, there was a file 90-nproc.conf in /etc/security/limits.d/ deployed. This limits the nr of processes for users to 1024. The user ambari received these limits and when starting hbase from ambari the limits are passed over somehow.

As @rmaruthiyodan mentions you can check the running process limits.

grep 'open files' /proc/<Ambari Agent PID>/limits
grep 'open files' /proc/<Region Server PID>/limits

Hbase book config suggests: 'Set it to north of 10k'

View solution in original post

5 REPLIES 5

avatar

hello Sumit

If your ulimit is already set to unlimited or a very high number, you could actually getting insight on the number of open files with lsof | wc -l. You may need to increase the max number of filed handles in the os. check fs.file-max to see if this helps. this is to try to solve the cause.

An offlineMetaRepair, fix meta should help with the consequence.

avatar
Rising Star

@nmaillard - Thanks. Yes, I am aware of lsof and was planning to use it. Also could there be a setting in hbase which restricts number of open file handles in hbase itself and throws this error?

Also, you meant /proc/sys/fs/file-max?

Thanks

avatar
Expert Contributor

Hi Sumit,

You may also want to verify that the ulimit that is set, is actually applied to the process :

# cat /proc/<Region Server PID>/limits

It is possible that somehow the user limits are overridden when the process starts up.

avatar
Rising Star

Hey @rmaruthiyodan - Thanks. Yes, I had to use /proc to find region server PID specific limits. Basically, ambari restricts this number to 32K by default and this can be overridden in blueprint being submitted.

avatar
Expert Contributor

Depending on your OS the setting might be different then you expect. To check the actual value become root and switch to the user hbase and print the actual limits.

# on Hbase Region Server:
sudo -i
su hbase

# print limits for the user hbase:
ulimit -a

On our RedHat 6 system, there was a file 90-nproc.conf in /etc/security/limits.d/ deployed. This limits the nr of processes for users to 1024. The user ambari received these limits and when starting hbase from ambari the limits are passed over somehow.

As @rmaruthiyodan mentions you can check the running process limits.

grep 'open files' /proc/<Ambari Agent PID>/limits
grep 'open files' /proc/<Region Server PID>/limits

Hbase book config suggests: 'Set it to north of 10k'