Created on 01-21-2014 04:31 AM - edited 09-16-2022 01:52 AM
We are facing issues with Open File descriptors in Task tracker from cloudera manager and most of the Task tracker are in critical health.
Cloudera Manager: ver 4.7
Please find below the status of one such Task Tracker:
"Open file descriptors: 24,241. File descriptor limit: 32,768. Percentage in use: 73.98%. Critical threshold: 70.00%."
we have checked the File descriptor with the OS using the command "ulimit -a" and it shows "unlimited",
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 118683
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
In Cloudera documentation, we see the description for TaskTracker File Descriptors as:
"The health check thresholds of the number of file descriptors used. A failure of this health check may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support."
We found a similar issue in the issue list, but not sure whether they are talking about the same underlying problem:
https://issues.cloudera.org/browse/DISTRO-530
Is this indeed a bug in the underlying framework or is there a way to increase the Open File Descriptors thorugh Cloudera Manager? Any other approach to recover from this issue is also appreciated.
Thanks
Created 01-21-2014 08:30 AM
Hi Ashok, to increase the file descriptors for the task tracker with a CM controlled cluster you can do the following. Keep in mind that the Alert is only a warning.
Verify your current file descriptor ulimit value to your task tracker processes.
pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;
Perform the following on each node in the cluster. A ulimit of 65536 probably would be enough for most use cases. I just used 262144 as an extreme example.
1. Edit /usr/sbin/cmf-agent and change the ulimit -n setting.
2. Edit /usr/lib64/cmf/service/mapreduce/mapreduce.sh and add
ulimit -n 262144
at the top of the script.
3. Push these script changes out to all the slave nodes
4. A few nodes at a time in the same rack, shutdown datanode and tasktracker services on them.
5. On these stopped nodes run a: service cloudera-scm-agent hard_restart
6. Restart the datanode and tasktracker.
Verify your file descriptor ulimit value to your task tracker processes after making the change.
pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;
-roland
Created 01-21-2014 08:30 AM
Hi Ashok, to increase the file descriptors for the task tracker with a CM controlled cluster you can do the following. Keep in mind that the Alert is only a warning.
Verify your current file descriptor ulimit value to your task tracker processes.
pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;
Perform the following on each node in the cluster. A ulimit of 65536 probably would be enough for most use cases. I just used 262144 as an extreme example.
1. Edit /usr/sbin/cmf-agent and change the ulimit -n setting.
2. Edit /usr/lib64/cmf/service/mapreduce/mapreduce.sh and add
ulimit -n 262144
at the top of the script.
3. Push these script changes out to all the slave nodes
4. A few nodes at a time in the same rack, shutdown datanode and tasktracker services on them.
5. On these stopped nodes run a: service cloudera-scm-agent hard_restart
6. Restart the datanode and tasktracker.
Verify your file descriptor ulimit value to your task tracker processes after making the change.
pid=$(ps faux | grep java | grep tasktracker | awk -F " " '{print $2}');cat /proc/$pid/limits;
-roland
Created 01-22-2014 03:45 AM
Created 01-30-2014 01:21 PM
I am also experiencing large numbers of open connections on my datanodes.
using lsof I see ~23,000 open tcp connections to the datanode with status CLOSE_WAIT
Restarting the cluster brings the open file descriptors down to a sane number.
I believe this is being caused by https://issues.apache.org/jira/browse/HDFS-5671
Is cloudera aware of this? Any timeline on when it will be patched into CDH?
Created 02-04-2014 07:31 AM
What process is responsible for the open sockets? Is it a local JVM, like an HBase regionserver, or some remote network IP? Can you paste a few example lines from the following output?
sudo netstat -anp | grep CLOSE_WAIT
Also, what version of CDH are you on? There can be multiple causes for an issue like you're seeing and there have been bugs fixed for this in the past with workarounds too.
Finally, can you give us the "$JAVA_HOME/bin/java -version" output? there was a JVM bug in ConcurrentMarkSweep GC in versions below 1.6.0_31, the workaround for that is to add the following JVM property to your runtime settings for the various hadoop daemons:
-XX:-CMSConcurrentMTEnabled
Newer versions of CM add that for you automatically.
Created 11-01-2016 08:55 AM
Hi... I am facing the same issue on my dev cluster.
It has the CM5.7 and CDH5.6.
Also it has Java 1.8.
The number of open files descriptor just keeps on increasing. After restart of machine it brings down to a saner number.
Note: The only diff between this cluster and other clusters is that it also has Kafka installed on it. Can it be a reason?
Any idea why this is happening??
Ran below on dev nodes.
[root@hdp-dev01 ~]# /usr/sbin/lsof | sort -k 2 -n | awk ' { num [$1]++ } END { for (a in num) { print a,num[a] } }' | sort -k 2 -rn | head -n 20
cmf-agent 22642
netstat -anp | grep CLOSE_WAIT - output as below
tcp 1430 0 10.12.15.114:34176 10.12.15.114:7180 CLOSE_WAIT 16602/python2.6
Created on 10-23-2018 02:08 AM - edited 10-23-2018 02:09 AM
Resurrecting this topic with some clarity on the issue and it's remedy.
If RegionServers would be keeping dead connections to the Datanodes, the same symnptoms would be seen, many connection in CLOSE_WAIT, and file descriptor number increasing.
In extreme cases the limit could be reached, whioch would case the host node to fail with no more open file descriptors to use issue.
There wasa bug in HBase prior to CDH5.13 which is described in this upstream JIRA in more detail[1]:
HBASE-9393 Hbase does not closing a closed socket resulting in many CLOSE_WAIT]
This issue was patched in the following CDH releases:
CDH5.13.0, CDH5.13.1, CDH5.13.2, CDH5.13.3,
CDH5.14.0, CDH5.14.2, CDH5.14.4,
CDH5.15.0, CDH5.15.1,
CDH6.0.0.
[1] - upstream HBase JIRA - https://issues.apache.org/jira/browse/HBASE-9393?attachmentOrder=asc