Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Open File Descriptors warning in Cloudera Manager

avatar
Explorer

We are facing issues with Open File descriptors in Task tracker from cloudera manager and most of the Task tracker are in critical health.

 

Cloudera Manager: ver 4.7

 

Please find below the status of one such Task Tracker:

 

"Open file descriptors: 24,241. File descriptor limit: 32,768. Percentage in use: 73.98%. Critical threshold: 70.00%."

we have checked the File descriptor with the OS using the command "ulimit -a" and it shows "unlimited",

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 118683
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

 

In Cloudera documentation,  we see the description for TaskTracker File Descriptors as:

"The health check thresholds of the number of file descriptors used. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support."


We found a similar issue in the issue list, but not sure whether they are talking about the same underlying problem:

https://issues.cloudera.org/browse/DISTRO-530

 

Is this indeed a bug in the underlying framework or is there a way to increase the Open File Descriptors thorugh Cloudera Manager? Any other approach to recover from this issue is also appreciated.

 

Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi Ashok, to increase the file descriptors for the task tracker with a CM controlled cluster you can do the following. Keep in mind that the Alert is only a warning.

 

Verify your current file descriptor ulimit value to your task tracker processes.

 

pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;


Perform the following on each node in the cluster. A ulimit of 65536 probably would be enough for most use cases. I just used 262144 as an extreme example.

1. Edit /usr/sbin/cmf-agent and change the ulimit -n setting. 
2. Edit /usr/lib64/cmf/service/mapreduce/mapreduce.sh and add 

ulimit -n 262144 

at the top of the script. 
3. Push these script changes out to all the slave nodes 
4. A few nodes at a time in the same rack, shutdown datanode and tasktracker services on them. 
5. On these stopped nodes run a: service cloudera-scm-agent hard_restart 
6. Restart the datanode and tasktracker.

 

Verify your file descriptor ulimit value to your task tracker processes after making the change.

 

pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;

 

-roland

 

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

Hi Ashok, to increase the file descriptors for the task tracker with a CM controlled cluster you can do the following. Keep in mind that the Alert is only a warning.

 

Verify your current file descriptor ulimit value to your task tracker processes.

 

pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;


Perform the following on each node in the cluster. A ulimit of 65536 probably would be enough for most use cases. I just used 262144 as an extreme example.

1. Edit /usr/sbin/cmf-agent and change the ulimit -n setting. 
2. Edit /usr/lib64/cmf/service/mapreduce/mapreduce.sh and add 

ulimit -n 262144 

at the top of the script. 
3. Push these script changes out to all the slave nodes 
4. A few nodes at a time in the same rack, shutdown datanode and tasktracker services on them. 
5. On these stopped nodes run a: service cloudera-scm-agent hard_restart 
6. Restart the datanode and tasktracker.

 

Verify your file descriptor ulimit value to your task tracker processes after making the change.

 

pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;

 

-roland

 

avatar
Explorer
Tanq rteague.

avatar

I am also experiencing large numbers of open connections on my datanodes.

using lsof I see ~23,000 open tcp connections to the datanode with status CLOSE_WAIT

Restarting the cluster brings the open file descriptors down to a sane number.

I believe this is being caused by https://issues.apache.org/jira/browse/HDFS-5671

 

Is cloudera aware of this? Any timeline on when it will be patched into CDH?

 

 

 

avatar
Guru

What process is responsible for the open sockets?  Is it a local JVM, like an HBase regionserver, or some remote network IP?  Can you paste a few example lines from the following output?

 

sudo netstat -anp | grep CLOSE_WAIT

 

Also, what version of CDH are you on?  There can be multiple causes for an issue like you're seeing and there have been bugs fixed for this in the past with workarounds too.

 

Finally, can you give us the "$JAVA_HOME/bin/java -version" output? there was a JVM bug in ConcurrentMarkSweep GC in versions below 1.6.0_31, the workaround for that is to add the following JVM property to your runtime settings for the various hadoop daemons:

 

-XX:-CMSConcurrentMTEnabled

Newer versions of CM add that for you automatically.

 

 

avatar
Contributor

Hi... I am facing the same issue on my dev cluster. 

It has the CM5.7 and CDH5.6.

Also it has Java 1.8.

The number of open files descriptor just keeps on increasing. After restart of machine it brings down to a saner number.

Note: The only diff between this cluster and other clusters is that it also has Kafka installed on it. Can it be a reason?

 

Any idea why this is happening??

 

Ran below on dev nodes.

[root@hdp-dev01 ~]# /usr/sbin/lsof | sort -k 2 -n | awk ' { num [$1]++ } END { for (a in num) { print a,num[a] } }' | sort -k 2 -rn | head -n 20
cmf-agent 22642

 

netstat -anp | grep CLOSE_WAIT - output as below

tcp 1430 0 10.12.15.114:34176 10.12.15.114:7180 CLOSE_WAIT 16602/python2.6

Abhishek

avatar
Contributor

Resurrecting this topic with some clarity on the issue and it's remedy.

 

If RegionServers would be keeping dead connections to the Datanodes, the same symnptoms would be seen, many connection in CLOSE_WAIT, and file descriptor number increasing.

In extreme cases the limit could be reached, whioch would case the host node to fail with no more open file descriptors to use issue.

 

There wasa bug in HBase prior to CDH5.13 which is described in this upstream JIRA in more detail[1]:

HBASE-9393 Hbase does not closing a closed socket resulting in many CLOSE_WAIT]

 

This issue was patched in the following CDH releases:

CDH5.13.0, CDH5.13.1, CDH5.13.2, CDH5.13.3,

CDH5.14.0,  CDH5.14.2, CDH5.14.4,

CDH5.15.0,  CDH5.15.1,

CDH6.0.0.

 

[1] - upstream HBase JIRA - https://issues.apache.org/jira/browse/HBASE-9393?attachmentOrder=asc